All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages.
Just drop the argument.
Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make pagevec_lookup() (and underlying find_get_pages()) update index to
the next page where iteration should continue. Most callers want this
and also pagevec_lookup_tag() already does this.
Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rcu_dereference_key() and user_key_payload() are currently being used in
two different, incompatible ways:
(1) As a wrapper to rcu_dereference() - when only the RCU read lock used
to protect the key.
(2) As a wrapper to rcu_dereference_protected() - when the key semaphor is
used to protect the key and the may be being modified.
Fix this by splitting both of the key wrappers to produce:
(1) RCU accessors for keys when caller has the key semaphore locked:
dereference_key_locked()
user_key_payload_locked()
(2) RCU accessors for keys when caller holds the RCU read lock:
dereference_key_rcu()
user_key_payload_rcu()
This should fix following warning in the NFS idmapper
===============================
[ INFO: suspicious RCU usage. ]
4.10.0 #1 Tainted: G W
-------------------------------
./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 0
1 lock held by mount.nfs/5987:
#0: (rcu_read_lock){......}, at: [<d000000002527abc>] nfs_idmap_get_key+0x15c/0x420 [nfsv4]
stack backtrace:
CPU: 1 PID: 5987 Comm: mount.nfs Tainted: G W 4.10.0 #1
Call Trace:
dump_stack+0xe8/0x154 (unreliable)
lockdep_rcu_suspicious+0x140/0x190
nfs_idmap_get_key+0x380/0x420 [nfsv4]
nfs_map_name_to_uid+0x2a0/0x3b0 [nfsv4]
decode_getfattr_attrs+0xfac/0x16b0 [nfsv4]
decode_getfattr_generic.constprop.106+0xbc/0x150 [nfsv4]
nfs4_xdr_dec_lookup_root+0xac/0xb0 [nfsv4]
rpcauth_unwrap_resp+0xe8/0x140 [sunrpc]
call_decode+0x29c/0x910 [sunrpc]
__rpc_execute+0x140/0x8f0 [sunrpc]
rpc_run_task+0x170/0x200 [sunrpc]
nfs4_call_sync_sequence+0x68/0xa0 [nfsv4]
_nfs4_lookup_root.isra.44+0xd0/0xf0 [nfsv4]
nfs4_lookup_root+0xe0/0x350 [nfsv4]
nfs4_lookup_root_sec+0x70/0xa0 [nfsv4]
nfs4_find_root_sec+0xc4/0x100 [nfsv4]
nfs4_proc_get_rootfh+0x5c/0xf0 [nfsv4]
nfs4_get_rootfh+0x6c/0x190 [nfsv4]
nfs4_server_common_setup+0xc4/0x260 [nfsv4]
nfs4_create_server+0x278/0x3c0 [nfsv4]
nfs4_remote_mount+0x50/0xb0 [nfsv4]
mount_fs+0x74/0x210
vfs_kern_mount+0x78/0x220
nfs_do_root_mount+0xb0/0x140 [nfsv4]
nfs4_try_mount+0x60/0x100 [nfsv4]
nfs_fs_mount+0x5ec/0xda0 [nfs]
mount_fs+0x74/0x210
vfs_kern_mount+0x78/0x220
do_mount+0x254/0xf70
SyS_mount+0x94/0x100
system_call+0x38/0xe0
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Under some circumstances, an fscache object can become queued such that it
fscache_object_work_func() can be called once the object is in the
OBJECT_DEAD state. This results in the kernel oopsing when it tries to
invoke the handler for the state (which is hard coded to 0x2).
The way this comes about is something like the following:
(1) The object dispatcher is processing a work state for an object. This
is done in workqueue context.
(2) An out-of-band event comes in that isn't masked, causing the object to
be queued, say EV_KILL.
(3) The object dispatcher finishes processing the current work state on
that object and then sees there's another event to process, so,
without returning to the workqueue core, it processes that event too.
It then follows the chain of events that initiates until we reach
OBJECT_DEAD without going through a wait state (such as
WAIT_FOR_CLEARANCE).
At this point, object->events may be 0, object->event_mask will be 0
and oob_event_mask will be 0.
(4) The object dispatcher returns to the workqueue processor, and in due
course, this sees that the object's work item is still queued and
invokes it again.
(5) The current state is a work state (OBJECT_DEAD), so the dispatcher
jumps to it - resulting in an OOPS.
When I'm seeing this, the work state in (1) appears to have been either
LOOK_UP_OBJECT or CREATE_OBJECT (object->oob_table is
fscache_osm_lookup_oob).
The window for (2) is very small:
(A) object->event_mask is cleared whilst the event dispatch process is
underway - though there's no memory barrier to force this to the top
of the function.
The window, therefore is from the time the object was selected by the
workqueue processor and made requeueable to the time the mask was
cleared.
(B) fscache_raise_event() will only queue the object if it manages to set
the event bit and the corresponding event_mask bit was set.
The enqueuement is then deferred slightly whilst we get a ref on the
object and get the per-CPU variable for workqueue congestion. This
slight deferral slightly increases the probability by allowing extra
time for the workqueue to make the item requeueable.
Handle this by giving the dead state a processor function and checking the
for the dead state address rather than seeing if the processor function is
address 0x2. The dead state processor function can then set a flag to
indicate that it's occurred and give a warning if it occurs more than once
per object.
If this race occurs, an oops similar to the following is seen (note the RIP
value):
BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
IP: [<0000000000000002>] 0x1
PGD 0
Oops: 0010 [#1] SMP
Modules linked in: ...
CPU: 17 PID: 16077 Comm: kworker/u48:9 Not tainted 3.10.0-327.18.2.el7.x86_64 #1
Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015
Workqueue: fscache_object fscache_object_work_func [fscache]
task: ffff880302b63980 ti: ffff880717544000 task.ti: ffff880717544000
RIP: 0010:[<0000000000000002>] [<0000000000000002>] 0x1
RSP: 0018:ffff880717547df8 EFLAGS: 00010202
RAX: ffffffffa0368640 RBX: ffff880edf7a4480 RCX: dead000000200200
RDX: 0000000000000002 RSI: 00000000ffffffff RDI: ffff880edf7a4480
RBP: ffff880717547e18 R08: 0000000000000000 R09: dfc40a25cb3a4510
R10: dfc40a25cb3a4510 R11: 0000000000000400 R12: 0000000000000000
R13: ffff880edf7a4510 R14: ffff8817f6153400 R15: 0000000000000600
FS: 0000000000000000(0000) GS:ffff88181f420000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000002 CR3: 000000000194a000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffffffffa0363695 ffff880edf7a4510 ffff88093f16f900 ffff8817faa4ec00
ffff880717547e60 ffffffff8109d5db 00000000faa4ec18 0000000000000000
ffff8817faa4ec18 ffff88093f16f930 ffff880302b63980 ffff88093f16f900
Call Trace:
[<ffffffffa0363695>] ? fscache_object_work_func+0xa5/0x200 [fscache]
[<ffffffff8109d5db>] process_one_work+0x17b/0x470
[<ffffffff8109e4ac>] worker_thread+0x21c/0x400
[<ffffffff8109e290>] ? rescuer_thread+0x400/0x400
[<ffffffff810a5acf>] kthread+0xcf/0xe0
[<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
[<ffffffff816460d8>] ret_from_fork+0x58/0x90
[<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeremy McNicoll <jeremymc@redhat.com>
Tested-by: Frank Sorenson <sorenson@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fscache_disable_cookie() needs to clear the outstanding writes on the
cookie it's disabling because they cannot be completed after.
Without this, fscache_nfs_open_file() gets stuck because it disables the
cookie when the file is opened for writing but can't uncache the pages till
afterwards - otherwise there's a race between the open routine and anyone
who already has it open R/O and is still reading from it.
Looking in /proc/pid/stack of the offending process shows:
[<ffffffffa0142883>] __fscache_wait_on_page_write+0x82/0x9b [fscache]
[<ffffffffa014336e>] __fscache_uncache_all_inode_pages+0x91/0xe1 [fscache]
[<ffffffffa01740fa>] nfs_fscache_open_file+0x59/0x9e [nfs]
[<ffffffffa01ccf41>] nfs4_file_open+0x17f/0x1b8 [nfsv4]
[<ffffffff8117350e>] do_dentry_open+0x16d/0x2b7
[<ffffffff811743ac>] vfs_open+0x5c/0x65
[<ffffffff81184185>] path_openat+0x785/0x8fb
[<ffffffff81184343>] do_filp_open+0x48/0x9e
[<ffffffff81174710>] do_sys_open+0x13b/0x1cb
[<ffffffff811747b9>] SyS_open+0x19/0x1b
[<ffffffff81001c44>] do_syscall_64+0x80/0x17a
[<ffffffff8165c2da>] return_from_SYSCALL_64+0x0/0x7a
[<ffffffffffffffff>] 0xffffffffffffffff
Reported-by: Jianhong Yin <jiyin@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Initialise the stores_lock in fscache netfs cookies. Technically, it
shouldn't be necessary, since the netfs cookie is an index and stores no
data, but initialising it anyway adds insignificant overhead.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
it's not needed for file_operations of inodes located on fs defined
in the hosting module and for file_operations that go into procfs.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Handle a write being requested to the page immediately beyond the EOF
marker on a cache object. Currently this gets an assertion failure in
CacheFiles because the EOF marker is used there to encode information about
a partial page at the EOF - which could lead to an unknown blank spot in
the file if we extend the file over it.
The problem is actually in fscache where we check the index of the page
being written against store_limit. store_limit is set to the number of
pages that we're allowed to store by fscache_set_store_limit() - which
means it's one more than the index of the last page we're allowed to store.
The problem is that we permit writing to a page with an index _equal_ to
the store limit - when we should reject that case.
Whilst we're at it, change the triggered assertion in CacheFiles to just
return -ENOBUFS instead.
The assertion failure looks something like this:
CacheFiles: Assertion failed
1000 < 7b1 is false
------------[ cut here ]------------
kernel BUG at fs/cachefiles/rdwr.c:962!
...
RIP: 0010:[<ffffffffa02c9e83>] [<ffffffffa02c9e83>] cachefiles_write_page+0x273/0x2d0 [cachefiles]
Cc: stable@vger.kernel.org # v2.6.31+; earlier - that + backport of a17754f (at least)
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Only override netfs->primary_index when registering success.
Cc: stable@vger.kernel.org # v2.6.30+
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If netfs exist, fscache should not increase the reference of parent's
usage and n_children, otherwise, never be decreased.
v2: thanks David's suggest,
move increasing reference of parent if success
use kmem_cache_free() freeing primary_index directly
v3: don't move "netfs->primary_index->parent = &fscache_fsdef_index;"
Cc: stable@vger.kernel.org # v2.6.30+
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".
Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.
This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.
This patch then converts a number of sites
o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.
o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.
o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.
o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.
The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.
The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the retrieval operation may be disposed of by fscache_put_operation()
before we actually set the context, the retrieval-specific cleanup operation
can produce a NULL-pointer dereference when it tries to unconditionally clean
up the netfs context.
Given that it is expected that we'll get at least as far as the place where we
currently set the context pointer and it is unlikely we'll go through the
error handling paths prior to that point, retain the context right from the
point that the retrieval op is allocated.
Concomitant to this, we need to retain the cookie pointer in the retrieval op
also so that we can call the netfs to release its context in the release
method.
In addition, we might now get into fscache_release_retrieval_op() with the op
only initialised. To this end, set the operation to DEAD only after the
release method has been called and skip the n_pages test upon cleanup if the
op is still in the INITIALISED state.
Without these changes, the following oops might be seen:
BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
...
RIP: 0010:[<ffffffffa0089c98>] fscache_release_retrieval_op+0xae/0x100
...
Call Trace:
[<ffffffffa0088560>] fscache_put_operation+0x117/0x2e0
[<ffffffffa008b8f5>] __fscache_read_or_alloc_pages+0x351/0x3ac
[<ffffffffa00b761f>] __nfs_readpages_from_fscache+0x59/0xbf [nfs]
[<ffffffffa00b06c5>] nfs_readpages+0x10c/0x185 [nfs]
[<ffffffff81124925>] ? alloc_pages_current+0x119/0x13e
[<ffffffff810ee5fd>] ? __page_cache_alloc+0xfb/0x10a
[<ffffffff810f87f8>] __do_page_cache_readahead+0x188/0x22c
[<ffffffff810f8b3a>] ondemand_readahead+0x29e/0x2af
[<ffffffff810f8c92>] page_cache_sync_readahead+0x38/0x3a
[<ffffffff810ef337>] generic_file_read_iter+0x1a2/0x55a
[<ffffffffa00a9dff>] ? nfs_revalidate_mapping+0xd6/0x288 [nfs]
[<ffffffffa00a6a23>] nfs_file_read+0x49/0x70 [nfs]
[<ffffffff811363be>] new_sync_read+0x78/0x9c
[<ffffffff81137164>] __vfs_read+0x13/0x38
[<ffffffff8113721e>] vfs_read+0x95/0x121
[<ffffffff811372f6>] SyS_read+0x4c/0x8a
[<ffffffff81557a52>] system_call_fastpath+0x12/0x17
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
Acked-by: Jeff Layton <jeff.layton@primarydata.com>
Any time an incomplete operation is cancelled, the operation cancellation
function needs to be called to clean up. This is currently being passed
directly to some of the functions that might want to call it, but not all.
Instead, pass the cancellation method pointer to the fscache_operation_init()
and have that cache it in the operation struct. Further, plug in a dummy
cancellation handler if the caller declines to set one as this allows us to
call the function unconditionally (the extra overhead isn't worth bothering
about as we don't expect to be calling this typically).
The cancellation method must thence be called everywhere the CANCELLED state
is set. Note that we call it *before* setting the CANCELLED state such that
the method can use the old state value to guide its operation.
fscache_do_cancel_retrieval() needs moving higher up in the sources so that
the init function can use it now.
Without this, the following oops may be seen:
FS-Cache: Assertion failed
FS-Cache: 3 == 0 is false
------------[ cut here ]------------
kernel BUG at ../fs/fscache/page.c:261!
...
RIP: 0010:[<ffffffffa0089c1b>] fscache_release_retrieval_op+0x77/0x100
[<ffffffffa008853d>] fscache_put_operation+0x114/0x2da
[<ffffffffa008b8c2>] __fscache_read_or_alloc_pages+0x358/0x3b3
[<ffffffffa00b761f>] __nfs_readpages_from_fscache+0x59/0xbf [nfs]
[<ffffffffa00b06c5>] nfs_readpages+0x10c/0x185 [nfs]
[<ffffffff81124925>] ? alloc_pages_current+0x119/0x13e
[<ffffffff810ee5fd>] ? __page_cache_alloc+0xfb/0x10a
[<ffffffff810f87f8>] __do_page_cache_readahead+0x188/0x22c
[<ffffffff810f8b3a>] ondemand_readahead+0x29e/0x2af
[<ffffffff810f8c92>] page_cache_sync_readahead+0x38/0x3a
[<ffffffff810ef337>] generic_file_read_iter+0x1a2/0x55a
[<ffffffffa00a9dff>] ? nfs_revalidate_mapping+0xd6/0x288 [nfs]
[<ffffffffa00a6a23>] nfs_file_read+0x49/0x70 [nfs]
[<ffffffff811363be>] new_sync_read+0x78/0x9c
[<ffffffff81137164>] __vfs_read+0x13/0x38
[<ffffffff8113721e>] vfs_read+0x95/0x121
[<ffffffff811372f6>] SyS_read+0x4c/0x8a
[<ffffffff81557a52>] system_call_fastpath+0x12/0x17
The assertion is showing that the remaining number of pages (n_pages) is not 0
when the operation is being released.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
Acked-by: Jeff Layton <jeff.layton@primarydata.com>
Call fscache_put_operation() or a wrapper on any op that has gone through
fscache_operation_init() so that the accounting shown in /proc is done
correctly, specifically fscache_n_op_release.
fscache_put_operation() therefore now allows an op in the INITIALISED state as
well as in the CANCELLED and COMPLETE states.
Note that this means that an operation can get put that doesn't have its
->object pointer filled in, so anything that depends on the object needs to be
conditional in fscache_put_operation().
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
Acked-by: Jeff Layton <jeff.layton@primarydata.com>
Cancellation of an in-progress operation needs to update the relevant counters
and start any operations that are pending waiting on this one.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
Acked-by: Jeff Layton <jeff.layton@primarydata.com>
Count and display through /proc/fs/fscache/stats the number of initialised
operations.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
Acked-by: Jeff Layton <jeff.layton@primarydata.com>
Out of line fscache_operation_init() so that it can access internal FS-Cache
features, such as stats, in a later commit.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
Acked-by: Jeff Layton <jeff.layton@primarydata.com>
Currently, fscache_cancel_op() only cancels pending operations - attempts to
cancel in-progress operations are ignored. This leads to a problem in
fscache_wait_for_operation_activation() whereby the wait is terminated, but
the object has been killed.
The check at the end of the function now triggers because it's no longer
contingent on the cache having produced an I/O error since the commit that
fixed the logic error in fscache_object_is_dead().
The result of the check is that it tries to cancel the operation - but since
the object may not be pending by this point, the cancellation request may be
ignored - with the result that the the object is just put by the caller and
fscache_put_operation has an assertion failure because the operation isn't in
either the COMPLETE or the CANCELLED states.
To fix this, we permit in-progress ops to be cancelled under some
circumstances.
The bug results in an oops that looks something like this:
FS-Cache: fscache_wait_for_operation_activation() = -ENOBUFS [obj dead 3]
FS-Cache:
FS-Cache: Assertion failed
FS-Cache: 3 == 5 is false
------------[ cut here ]------------
kernel BUG at ../fs/fscache/operation.c:432!
...
RIP: 0010:[<ffffffffa0088574>] fscache_put_operation+0xf2/0x2cd
Call Trace:
[<ffffffffa008b92a>] __fscache_read_or_alloc_pages+0x2ec/0x3b3
[<ffffffffa00b761f>] __nfs_readpages_from_fscache+0x59/0xbf [nfs]
[<ffffffffa00b06c5>] nfs_readpages+0x10c/0x185 [nfs]
[<ffffffff81124925>] ? alloc_pages_current+0x119/0x13e
[<ffffffff810ee5fd>] ? __page_cache_alloc+0xfb/0x10a
[<ffffffff810f87f8>] __do_page_cache_readahead+0x188/0x22c
[<ffffffff810f8b3a>] ondemand_readahead+0x29e/0x2af
[<ffffffff810f8c92>] page_cache_sync_readahead+0x38/0x3a
[<ffffffff810ef337>] generic_file_read_iter+0x1a2/0x55a
[<ffffffffa00a9dff>] ? nfs_revalidate_mapping+0xd6/0x288 [nfs]
[<ffffffffa00a6a23>] nfs_file_read+0x49/0x70 [nfs]
[<ffffffff811363be>] new_sync_read+0x78/0x9c
[<ffffffff81137164>] __vfs_read+0x13/0x38
[<ffffffff8113721e>] vfs_read+0x95/0x121
[<ffffffff811372f6>] SyS_read+0x4c/0x8a
[<ffffffff81557a52>] system_call_fastpath+0x12/0x17
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
Acked-by: Jeff Layton <jeff.layton@primarydata.com>
fscache_object_is_dead() returns true only if the object is marked dead and
the cache got an I/O error. This should be a logical OR instead. Since two
of the callers got split up into handling for separate subcases, expand the
other callers and kill the function. This is probably the right thing to do
anyway since one of the subcases isn't about the object at all, but rather
about the cache.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
Acked-by: Jeff Layton <jeff.layton@primarydata.com>
When an object is being marked as no longer live, do this under the object
spinlock to prevent a race with operation submission targeted on that object.
The problem occurs due to the following pair of intertwined sequences when the
cache tries to create an object that would take it over the hard available
space limit:
NETFS INTERFACE
===============
(A) The netfs calls fscache_acquire_cookie(). object creation is deferred to
the object state machine and the netfs is allowed to continue.
OBJECT STATE MACHINE KTHREAD
============================
(1) The object is looked up on disk by fscache_look_up_object()
calling cachefiles_walk_to_object(). The latter finds that the
object is not yet represented on disk and calls
fscache_object_lookup_negative().
(2) fscache_object_lookup_negative() sets FSCACHE_COOKIE_NO_DATA_YET
and clears FSCACHE_COOKIE_LOOKING_UP, thus allowing the netfs to
start queuing read operations.
(B) The netfs calls fscache_read_or_alloc_pages(). This calls
fscache_wait_for_deferred_lookup() which sees FSCACHE_COOKIE_LOOKING_UP
become clear, allowing the read to begin.
(C) A read operation is set up and passed to fscache_submit_op() to deal
with.
(3) cachefiles_walk_to_object() calls cachefiles_has_space(), which
fails (or one of the file operations to create stuff fails).
cachefiles returns an error to fscache.
(4) fscache_look_up_object() transits to the LOOKUP_FAILURE state,
(5) fscache_lookup_failure() sets FSCACHE_OBJECT_LOOKED_UP and
FSCACHE_COOKIE_UNAVAILABLE and clears FSCACHE_COOKIE_LOOKING_UP
then transits to the KILL_OBJECT state.
(6) fscache_kill_object() clears FSCACHE_OBJECT_IS_LIVE in an attempt
to reject any further requests from the netfs.
(7) object->n_ops is examined and found to be 0.
fscache_kill_object() transits to the DROP_OBJECT state.
(D) fscache_submit_op() locks the object spinlock, sees if it can dispatch
the op immediately by calling fscache_object_is_active() - which fails
since FSCACHE_OBJECT_IS_AVAILABLE has not yet been set.
(E) fscache_submit_op() then tests FSCACHE_OBJECT_LOOKED_UP - which is set.
It then queues the object and increments object->n_ops.
(8) fscache_drop_object() releases the object and eventually
fscache_put_object() calls cachefiles_put_object() which suffers
an assertion failure here:
ASSERTCMP(object->fscache.n_ops, ==, 0);
Locking the object spinlock in step (6) around the clearance of
FSCACHE_OBJECT_IS_LIVE ensures that the the decision trees in
fscache_submit_op() and fscache_submit_exclusive_op() don't see the IS_LIVE
flag being cleared mid-decision: either the op is queued before step (7) - in
which case fscache_kill_object() will see n_ops>0 and will deal with the op -
or the op will be rejected.
This, combined with rejecting op submission if the target object is dying, fix
the problem.
The problem shows up as the following oops:
CacheFiles: Assertion failed
CacheFiles: 1 == 0 is false
------------[ cut here ]------------
kernel BUG at ../fs/cachefiles/interface.c:339!
...
RIP: 0010:[<ffffffffa014fd9c>] [<ffffffffa014fd9c>] cachefiles_put_object+0x2a4/0x301 [cachefiles]
...
Call Trace:
[<ffffffffa008674b>] fscache_put_object+0x18/0x21 [fscache]
[<ffffffffa00883e6>] fscache_object_work_func+0x3ba/0x3c9 [fscache]
[<ffffffff81054dad>] process_one_work+0x226/0x441
[<ffffffff81055d91>] worker_thread+0x273/0x36b
[<ffffffff81055b1e>] ? rescuer_thread+0x2e1/0x2e1
[<ffffffff81059b9d>] kthread+0x10e/0x116
[<ffffffff81059a8f>] ? kthread_create_on_node+0x1bb/0x1bb
[<ffffffff815579ac>] ret_from_fork+0x7c/0xb0
[<ffffffff81059a8f>] ? kthread_create_on_node+0x1bb/0x1bb
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
Acked-by: Jeff Layton <jeff.layton@primarydata.com>
Reject new operations that are being submitted against an object if that
object has failed its lookup or creation states or has been killed by the
cache backend for some other reason, such as having been culled.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
Acked-by: Jeff Layton <jeff.layton@primarydata.com>
When submitting an operation, prefer to cancel the operation immediately
rather than queuing it for later processing if the object is marked as dying
(ie. the object state machine has reached the KILL_OBJECT state).
Whilst we're at it, change the series of related test_bit() calls into a
READ_ONCE() and bitwise-AND operators to reduce the number of load
instructions (test_bit() has a volatile address).
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
Acked-by: Jeff Layton <jeff.layton@primarydata.com>
Move fscache_report_unexpected_submission() up within operation.c so that it
can be called from fscache_submit_exclusive_op() too.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
Acked-by: Jeff Layton <jeff.layton@primarydata.com>
Count the number of objects that get culled by the cache backend and the
number of objects that the cache backend declines to instantiate due to lack
of space in the cache.
These numbers are made available through /proc/fs/fscache/stats
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
Acked-by: Jeff Layton <jeff.layton@primarydata.com>
Reduce boilerplate code by using __seq_open_private() instead of seq_open()
in fscache_objlist_open().
Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
I've been seeing issues with disposing cookies under vma pressure. The symptom
is that the refcount gets out of sync. In this case we fail to decrement the
refcount if submit fails. I found this while auditing the error in and around
cookie operations.
Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: David Howells <dhowells@redhat.com>
fscache_sysctls and fscache_sysctls_root are only used in main.c
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: David Howells <dhowells@redhat.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().
So:
Rename wait_on_bit and wait_on_bit_lock to
wait_on_bit_action and wait_on_bit_lock_action
to make it explicit that they need an action function.
Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
which are *not* given an action function but implicitly use
a standard one.
The decision to error-out if a signal is pending is now made
based on the 'mode' argument rather than being encoded in the action
function.
All instances of the old wait_on_bit and wait_on_bit_lock which
can use the new version have been changed accordingly and their
action functions have been discarded.
wait_on_bit{_lock} does not return any specific error code in the
event of a signal so the caller must check for non-zero and
interpolate their own error code as appropriate.
The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"
The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.
A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack. So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).
Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS. CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This typedef is unnecessary and should just be removed.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace seq_printf where possible + coalesce formats from 2 existing
seq_puts
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When FS-Cache allocates an object, the following sequence of events can
occur:
-->fscache_alloc_object()
-->cachefiles_alloc_object() [via cache->ops->alloc_object]
<--[returns new object]
-->fscache_attach_object()
<--[failed]
-->cachefiles_put_object() [via cache->ops->put_object]
-->fscache_object_destroy()
-->fscache_objlist_remove()
-->rb_erase() to remove the object from fscache_object_list.
resulting in a crash in the rbtree code.
The problem is that the object is only added to fscache_object_list on
the success path of fscache_attach_object() where it calls
fscache_objlist_add().
So if fscache_attach_object() fails, the object won't have been added to
the objlist rbtree. We do, however, unconditionally try to remove the
object from the tree.
Thanks to NeilBrown for finding this and suggesting this solution.
Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: (a customer of) NeilBrown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull block IO core updates from Jens Axboe:
"This is the pull request for the core changes in the block layer for
3.13. It contains:
- The new blk-mq request interface.
This is a new and more scalable queueing model that marries the
best part of the request based interface we currently have (which
is fully featured, but scales poorly) and the bio based "interface"
which the new drivers for high IOPS devices end up using because
it's much faster than the request based one.
The bio interface has no block layer support, since it taps into
the stack much earlier. This means that drivers end up having to
implement a lot of functionality on their own, like tagging,
timeout handling, requeue, etc. The blk-mq interface provides all
these. Some drivers even provide a switch to select bio or rq and
has code to handle both, since things like merging only works in
the rq model and hence is faster for some workloads. This is a
huge mess. Conversion of these drivers nets us a substantial code
reduction. Initial results on converting SCSI to this model even
shows an 8x improvement on single queue devices. So while the
model was intended to work on the newer multiqueue devices, it has
substantial improvements for "classic" hardware as well. This code
has gone through extensive testing and development, it's now ready
to go. A pull request is coming to convert virtio-blk to this
model will be will be coming as well, with more drivers scheduled
for 3.14 conversion.
- Two blktrace fixes from Jan and Chen Gang.
- A plug merge fix from Alireza Haghdoost.
- Conversion of __get_cpu_var() from Christoph Lameter.
- Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.
- A fix for a race between request completion and the timeout
handling from Jeff Moyer. This is what caused the merge conflict
with blk-mq/core, in case you are looking at that.
- A dm stacking fix from Mike Snitzer.
- A code consolidation fix and duplicated code removal from Kent
Overstreet.
- A handful of block bug fixes from Mikulas Patocka, fixing a loop
crash and memory corruption on blk cg.
- Elevator switch bug fix from Tomoki Sekiyama.
A heads-up that I had to rebase this branch. Initially the immutable
bio_vecs had been queued up for inclusion, but a week later, it became
clear that it wasn't fully cooked yet. So the decision was made to
pull this out and postpone it until 3.14. It was a straight forward
rebase, just pruning out the immutable series and the later fixes of
problems with it. The rest of the patches applied directly and no
further changes were made"
* 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
block: Do not call sector_div() with a 64-bit divisor
kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
block: Consolidate duplicated bio_trim() implementations
block: Use rw_copy_check_uvector()
block: Enable sysfs nomerge control for I/O requests in the plug list
block: properly stack underlying max_segment_size to DM device
elevator: acquire q->sysfs_lock in elevator_change()
elevator: Fix a race in elevator switching and md device initialization
block: Replace __get_cpu_var uses
bdi: test bdi_init failure
block: fix a probe argument to blk_register_region
loop: fix crash if blk_alloc_queue fails
blk-core: Fix memory corruption if blkcg_init_queue fails
block: fix race between request completion and timeout handling
blktrace: Send BLK_TN_PROCESS events to all running traces
blk-mq: don't disallow request merges for req->special being set
blk-mq: mq plug list breakage
blk-mq: fix for flush deadlock
...
__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x). This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.
Other use cases are for storing and retrieving data from the current
processors percpu area. __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.
__get_cpu_var() is defined as :
#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.
this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.
This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset. Thereby address calculations are avoided and less registers
are used when code is generated.
At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.
The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e. using a global
register that may be set to the per cpu base.
Transformations done to __get_cpu_var()
1. Determine the address of the percpu instance of the current processor.
DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(&y);
2. Same as #1 but this time an array structure is involved.
DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(y);
3. Retrieve the content of the current processors instance of a per cpu
variable.
DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)
Converts to
int x = __this_cpu_read(y);
4. Retrieve the content of a percpu struct
DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);
Converts to
memcpy(&x, this_cpu_ptr(&y), sizeof(x));
5. Assignment to a per cpu variable
DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;
Converts to
this_cpu_write(y, x);
6. Increment/Decrement etc of a per cpu variable
DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++
Converts to
this_cpu_inc(y)
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Provide the ability to enable and disable fscache cookies. A disabled cookie
will reject or ignore further requests to:
Acquire a child cookie
Invalidate and update backing objects
Check the consistency of a backing object
Allocate storage for backing page
Read backing pages
Write to backing pages
but still allows:
Checks/waits on the completion of already in-progress objects
Uncaching of pages
Relinquishment of cookies
Two new operations are provided:
(1) Disable a cookie:
void fscache_disable_cookie(struct fscache_cookie *cookie,
bool invalidate);
If the cookie is not already disabled, this locks the cookie against other
dis/enablement ops, marks the cookie as being disabled, discards or
invalidates any backing objects and waits for cessation of activity on any
associated object.
This is a wrapper around a chunk split out of fscache_relinquish_cookie(),
but it reinitialises the cookie such that it can be reenabled.
All possible failures are handled internally. The caller should consider
calling fscache_uncache_all_inode_pages() afterwards to make sure all page
markings are cleared up.
(2) Enable a cookie:
void fscache_enable_cookie(struct fscache_cookie *cookie,
bool (*can_enable)(void *data),
void *data)
If the cookie is not already enabled, this locks the cookie against other
dis/enablement ops, invokes can_enable() and, if the cookie is not an
index cookie, will begin the procedure of acquiring backing objects.
The optional can_enable() function is passed the data argument and returns
a ruling as to whether or not enablement should actually be permitted to
begin.
All possible failures are handled internally. The cookie will only be
marked as enabled if provisional backing objects are allocated.
A later patch will introduce these to NFS. Cookie enablement during nfs_open()
is then contingent on i_writecount <= 0. can_enable() checks for a race
between open(O_RDONLY) and open(O_WRONLY/O_RDWR). This simplifies NFS's cookie
handling and allows us to get rid of open(O_RDONLY) accidentally introducing
caching to an inode that's open for writing already.
One operation has its API modified:
(3) Acquire a cookie.
struct fscache_cookie *fscache_acquire_cookie(
struct fscache_cookie *parent,
const struct fscache_cookie_def *def,
void *netfs_data,
bool enable);
This now has an additional argument that indicates whether the requested
cookie should be enabled by default. It doesn't need the can_enable()
function because the caller must prevent multiple calls for the same netfs
object and it doesn't need to take the enablement lock because no one else
can get at the cookie before this returns.
Signed-off-by: David Howells <dhowells@redhat.com
Add wrapper functions for dealing with cookie->n_active:
(*) __fscache_use_cookie() to increment it.
(*) __fscache_unuse_cookie() to decrement and test against zero.
(*) __fscache_wake_unused_cookie() to wake up anyone waiting for it to reach
zero.
The second and third are split so that the third can be done after cookie->lock
has been released in case the waiter wakes up whilst we're still holding it and
tries to get it.
We will need to wake-on-zero once the cookie disablement patch is applied
because it will then be possible to see n_active become zero without the cookie
being relinquished.
Also move the cookie usement out of fscache_attr_changed_op() and into
fscache_attr_changed() and the operation struct so that cookie disablement
will be able to track it.
Whilst we're at it, only increment n_active if we're about to do
fscache_submit_op() so that we don't have to deal with undoing it if anything
earlier fails. Possibly this should be moved into fscache_submit_op() which
could look at FSCACHE_OP_UNUSE_COOKIE.
Signed-off-by: David Howells <dhowells@redhat.com>
Pull ceph fixes from Sage Weil:
"These fix several bugs with RBD from 3.11 that didn't get tested in
time for the merge window: some error handling, a use-after-free, and
a sequencing issue when unmapping and image races with a notify
operation.
There is also a patch fixing a problem with the new ceph + fscache
code that just went in"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
fscache: check consistency does not decrement refcount
rbd: fix error handling from rbd_snap_name()
rbd: ignore unmapped snapshots that no longer exist
rbd: fix use-after free of rbd_dev->disk
rbd: make rbd_obj_notify_ack() synchronous
rbd: complete notifies before cleaning up osd_client and rbd_dev
libceph: add function to ensure notifies are complete
With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
one such possible user), the following race can happen:
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];
<interrupt>
...
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];
And we give out one radix tree node twice. That clearly results in radix
tree corruption with different results (usually OOPS) depending on which
two users of radix tree race.
We fix the problem by making radix_tree_node_alloc() always allocate fresh
radix tree nodes when in interrupt. Using preloading when in interrupt
doesn't make sense since all the allocations have to be atomic anyway and
we cannot steal nodes from process-context users because some users rely
on radix_tree_insert() succeeding after radix_tree_preload().
in_interrupt() check is somewhat ugly but we cannot simply key off passed
gfp_mask as that is acquired from root_gfp_mask() and thus the same for
all preload users.
Another part of the fix is to avoid node preallocation in
radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
preallocation in such case doesn't make sense and when preallocation would
happen in interrupt we could possibly leak some allocated nodes. However,
some users of radix_tree_preload() require following radix_tree_insert()
to succeed. To avoid unexpected effects for these users,
radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
and we provide a new function radix_tree_maybe_preload() for those users
which get different gfp mask from different call sites and which are
prepared to handle radix_tree_insert() failure.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__fscache_check_consistency() does not decrement the count of operations
active after it finishes in the success case. This leads to a hung tasks on
cookie de-registration (commonly in inode eviction).
INFO: task kworker/1:2:4214 blocked for more than 120 seconds.
kworker/1:2 D ffff880443513fc0 0 4214 2 0x00000000
Workqueue: ceph-msgr con_work [libceph]
...
Call Trace:
[<ffffffff81569fc6>] ? _raw_spin_unlock_irqrestore+0x16/0x20
[<ffffffffa0016570>] ? fscache_wait_bit_interruptible+0x30/0x30 [fscache]
[<ffffffff81568d09>] schedule+0x29/0x70
[<ffffffffa001657e>] fscache_wait_atomic_t+0xe/0x20 [fscache]
[<ffffffff815665cf>] out_of_line_wait_on_atomic_t+0x9f/0xe0
[<ffffffff81083560>] ? autoremove_wake_function+0x40/0x40
[<ffffffffa0015a9c>] __fscache_relinquish_cookie+0x15c/0x310 [fscache]
[<ffffffffa00a4fae>] ceph_fscache_unregister_inode_cookie+0x3e/0x50 [ceph]
[<ffffffffa007e373>] ceph_destroy_inode+0x33/0x200 [ceph]
[<ffffffff811c13ae>] ? __fsnotify_inode_delete+0xe/0x10
[<ffffffff8119ba1c>] destroy_inode+0x3c/0x70
[<ffffffff8119bb69>] evict+0x119/0x1b0
Signed-off-by: Milosz Tanski <milosz@adfin.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Currently the fscache code expect the netfs to call fscache_readpages_or_alloc
inside the aops readpages callback. It marks all the pages in the list
provided by readahead with PG_private_2. In the cases that the netfs fails to
read all the pages (which is legal) it ends up returning to the readahead and
triggering a BUG. This happens because the page list still contains marked
pages.
This patch implements a simple fscache_readpages_cancel function that the netfs
should call before returning from readpages. It will revoke the pages from the
underlying cache backend and unmark them.
The problem was originally worked out in the Ceph devel tree, but it also
occurs in CIFS. It appears that NFS, AFS and 9P are okay as read_cache_pages()
will clean up the unprocessed pages in the case of an error.
This can be used to address the following oops:
[12410647.597278] BUG: Bad page state in process petabucket pfn:3d504e
[12410647.597292] page:ffffea000f541380 count:0 mapcount:0 mapping:
(null) index:0x0
[12410647.597298] page flags: 0x200000000001000(private_2)
...
[12410647.597334] Call Trace:
[12410647.597345] [<ffffffff815523f2>] dump_stack+0x19/0x1b
[12410647.597356] [<ffffffff8111def7>] bad_page+0xc7/0x120
[12410647.597359] [<ffffffff8111e49e>] free_pages_prepare+0x10e/0x120
[12410647.597361] [<ffffffff8111fc80>] free_hot_cold_page+0x40/0x170
[12410647.597363] [<ffffffff81123507>] __put_single_page+0x27/0x30
[12410647.597365] [<ffffffff81123df5>] put_page+0x25/0x40
[12410647.597376] [<ffffffffa02bdcf9>] ceph_readpages+0x2e9/0x6e0 [ceph]
[12410647.597379] [<ffffffff81122a8f>] __do_page_cache_readahead+0x1af/0x260
[12410647.597382] [<ffffffff81122ea1>] ra_submit+0x21/0x30
[12410647.597384] [<ffffffff81118f64>] filemap_fault+0x254/0x490
[12410647.597387] [<ffffffff8113a74f>] __do_fault+0x6f/0x4e0
[12410647.597391] [<ffffffff810125bd>] ? __switch_to+0x16d/0x4a0
[12410647.597395] [<ffffffff810865ba>] ? finish_task_switch+0x5a/0xc0
[12410647.597398] [<ffffffff8113d856>] handle_pte_fault+0xf6/0x930
[12410647.597401] [<ffffffff81008c33>] ? pte_mfn_to_pfn+0x93/0x110
[12410647.597403] [<ffffffff81008cce>] ? xen_pmd_val+0xe/0x10
[12410647.597405] [<ffffffff81005469>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
[12410647.597407] [<ffffffff8113f361>] handle_mm_fault+0x251/0x370
[12410647.597411] [<ffffffff812b0ac4>] ? call_rwsem_down_read_failed+0x14/0x30
[12410647.597414] [<ffffffff8155bffa>] __do_page_fault+0x1aa/0x550
[12410647.597418] [<ffffffff8108011d>] ? up_write+0x1d/0x20
[12410647.597422] [<ffffffff8113141c>] ? vm_mmap_pgoff+0xbc/0xe0
[12410647.597425] [<ffffffff81143bb8>] ? SyS_mmap_pgoff+0xd8/0x240
[12410647.597427] [<ffffffff8155c3ae>] do_page_fault+0xe/0x10
[12410647.597431] [<ffffffff81558818>] page_fault+0x28/0x30
Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Extend the fscache netfs API so that the netfs can ask as to whether a cache
object is up to date with respect to its corresponding netfs object:
int fscache_check_consistency(struct fscache_cookie *cookie)
This will call back to the netfs to check whether the auxiliary data associated
with a cookie is correct. It returns 0 if it is and -ESTALE if it isn't; it
may also return -ENOMEM and -ERESTARTSYS.
The backends now have to implement a mandatory operation pointer:
int (*check_consistency)(struct fscache_object *object)
that corresponds to the above API call. FS-Cache takes care of pinning the
object and the cookie in memory and managing this call with respect to the
object state.
Original-author: Hongyi Jia <jiayisuse@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hongyi Jia <jiayisuse@gmail.com>
cc: Milosz Tanski <milosz@adfin.com>
Under certain circumstances, spin_is_locked() is hardwired to 0 - even when the
code would normally be in a locked section where it should return 1. This
means it cannot be used for an assertion that checks that a spinlock is locked.
Remove such usages from FS-Cache.
The following oops might otherwise be observed:
FS-Cache: Assertion failed
BUG: failure at fs/fscache/operation.c:270/fscache_start_operations()!
Kernel panic - not syncing: BUG!
CPU: 0 PID: 10 Comm: kworker/u2:1 Not tainted 3.10.0-rc1-00133-ge7ebb75 #2
Workqueue: fscache_operation fscache_op_work_func [fscache]
7f091c48 603c8947 7f090000 7f9b1361 7f25f080 00000001 7f26d440 7f091c90
60299eb8 7f091d90 602951c5 7f26d440 3000000008 7f091da0 7f091cc0 7f091cd0
00000007 00000007 00000006 7f091ae0 00000010 0000010e 7f9af330 7f091ae0
Call Trace:
7f091c88: [<60299eb8>] dump_stack+0x17/0x19
7f091c98: [<602951c5>] panic+0xf4/0x1e9
7f091d38: [<6002b10e>] set_signals+0x1e/0x40
7f091d58: [<6005b89e>] __wake_up+0x4e/0x70
7f091d98: [<7f9aa003>] fscache_start_operations+0x43/0x50 [fscache]
7f091da8: [<7f9aa1e3>] fscache_op_complete+0x1d3/0x220 [fscache]
7f091db8: [<60082985>] unlock_page+0x55/0x60
7f091de8: [<7fb25bb0>] cachefiles_read_copier+0x250/0x330 [cachefiles]
7f091e58: [<7f9ab03c>] fscache_op_work_func+0xac/0x120 [fscache]
7f091e88: [<6004d5b0>] process_one_work+0x250/0x3a0
7f091ef8: [<6004edc7>] worker_thread+0x177/0x2a0
7f091f38: [<6004ec50>] worker_thread+0x0/0x2a0
7f091f58: [<60054418>] kthread+0xd8/0xe0
7f091f68: [<6005bb27>] finish_task_switch.isra.64+0x37/0xa0
7f091fd8: [<600185cf>] new_thread_handler+0x8f/0xb0
Reported-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-By: Milosz Tanski <milosz@adfin.com>
struct fscache_retrieval contains a count of the number of pages that still
need some processing (n_pages). This is decremented as the pages are
processed.
However, this needs to be atomic as fscache_retrieval_complete() (I think) just
occasionally may be called from cachefiles_read_backing_file() and
cachefiles_read_copier() simultaneously.
This happens when an fscache_read_or_alloc_pages() request containing a lot of
pages (say a couple of hundred) is being processed. The read on each backing
page is dispatched individually because we need to insert a monitor into the
waitqueue to catch when the read completes. However, under low-memory
conditions, we might be forced to wait in the allocator - and this gives the
I/O on the backing page a chance to complete first.
When the I/O completes, fscache_enqueue_retrieval() chucks the retrieval onto
the workqueue without waiting for the operation to finish the initial I/O
dispatch (we want to release any pages we can as soon as we can), thus both can
end up running simultaneously and potentially attempting to partially complete
the retrieval simultaneously (ENOMEM may occur, backing pages may already be in
the page cache).
This was demonstrated by parallelling the non-atomic counter with an atomic
counter and printing both of them when the assertion fails. At this point, the
atomic counter has reached zero, but the non-atomic counter has not.
To fix this, make the counter an atomic_t.
This results in the following bug appearing
FS-Cache: Assertion failed
3 == 5 is false
------------[ cut here ]------------
kernel BUG at fs/fscache/operation.c:421!
or
FS-Cache: Assertion failed
3 == 5 is false
------------[ cut here ]------------
kernel BUG at fs/fscache/operation.c:414!
With a backtrace like the following:
RIP: 0010:[<ffffffffa0211b1d>] fscache_put_operation+0x1ad/0x240 [fscache]
Call Trace:
[<ffffffffa0213185>] fscache_retrieval_work+0x55/0x270 [fscache]
[<ffffffffa0213130>] ? fscache_retrieval_work+0x0/0x270 [fscache]
[<ffffffff81090b10>] worker_thread+0x170/0x2a0
[<ffffffff81096d10>] ? autoremove_wake_function+0x0/0x40
[<ffffffff810909a0>] ? worker_thread+0x0/0x2a0
[<ffffffff81096966>] kthread+0x96/0xa0
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffff810968d0>] ? kthread+0x0/0xa0
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Simplify the way fscache cache objects retain their cookie. The way I
implemented the cookie storage handling made synchronisation a pain (ie. the
object state machine can't rely on the cookie actually still being there).
Instead of the the object being detached from the cookie and the cookie being
freed in __fscache_relinquish_cookie(), we defer both operations:
(*) The detachment of the object from the list in the cookie now takes place
in fscache_drop_object() and is thus governed by the object state machine
(fscache_detach_from_cookie() has been removed).
(*) The release of the cookie is now in fscache_object_destroy() - which is
called by the cache backend just before it frees the object.
This means that the fscache_cookie struct is now available to the cache all the
way through from ->alloc_object() to ->drop_object() and ->put_object() -
meaning that it's no longer necessary to take object->lock to guarantee access.
However, __fscache_relinquish_cookie() doesn't wait for the object to go all
the way through to destruction before letting the netfs proceed. That would
massively slow down the netfs. Since __fscache_relinquish_cookie() leaves the
cookie around, in must therefore break all attachments to the netfs - which
includes ->def, ->netfs_data and any outstanding page read/writes.
To handle this, struct fscache_cookie now has an n_active counter:
(1) This starts off initialised to 1.
(2) Any time the cache needs to get at the netfs data, it calls
fscache_use_cookie() to increment it - if it is not zero. If it was zero,
then access is not permitted.
(3) When the cache has finished with the data, it calls fscache_unuse_cookie()
to decrement it. This does a wake-up on it if it reaches 0.
(4) __fscache_relinquish_cookie() decrements n_active and then waits for it to
reach 0. The initialisation to 1 in step (1) ensures that we only get
wake ups when we're trying to get rid of the cookie.
This leaves __fscache_relinquish_cookie() a lot simpler.
***
This fixes a problem in the current code whereby if fscache_invalidate() is
followed sufficiently quickly by fscache_relinquish_cookie() then it is
possible for __fscache_relinquish_cookie() to have detached the cookie from the
object and cleared the pointer before a thread is dispatched to process the
invalidation state in the object state machine.
Since the pending write clearance was deferred to the invalidation state to
make it asynchronous, we need to either wait in relinquishment for the stores
tree to be cleared in the invalidation state or we need to handle the clearance
in relinquishment.
Further, if the relinquishment code does clear the tree, then the invalidation
state need to make the clearance contingent on still having the cookie to hand
(since that's where the tree is rooted) and we have to prevent the cookie from
disappearing for the duration.
This can lead to an oops like the following:
BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
...
RIP: 0010:[<ffffffff8151023e>] _spin_lock+0xe/0x30
...
CR2: 000000000000000c ...
...
Process kslowd002 (...)
....
Call Trace:
[<ffffffffa01c3278>] fscache_invalidate_writes+0x38/0xd0 [fscache]
[<ffffffff810096f0>] ? __switch_to+0xd0/0x320
[<ffffffff8105e759>] ? find_busiest_queue+0x69/0x150
[<ffffffff8110ddd4>] ? slow_work_enqueue+0x104/0x180
[<ffffffffa01c1303>] fscache_object_slow_work_execute+0x5e3/0x9d0 [fscache]
[<ffffffff81096b67>] ? bit_waitqueue+0x17/0xd0
[<ffffffff8110e233>] slow_work_execute+0x233/0x310
[<ffffffff8110e515>] slow_work_thread+0x205/0x360
[<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8110e310>] ? slow_work_thread+0x0/0x360
[<ffffffff81096936>] kthread+0x96/0xa0
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffff810968a0>] ? kthread+0x0/0xa0
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20
The parameter to fscache_invalidate_writes() was object->cookie which is NULL.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Fix object state machine to have separate work and wait states as that makes
it easier to envision.
There are now three kinds of state:
(1) Work state. This is an execution state. No event processing is performed
by a work state. The function attached to a work state returns a pointer
indicating the next state to which the OSM should transition. Returning
NO_TRANSIT repeats the current state, but goes back to the scheduler
first.
(2) Wait state. This is an event processing state. No execution is
performed by a wait state. Wait states are just tables of "if event X
occurs, clear it and transition to state Y". The dispatcher returns to
the scheduler if none of the events in which the wait state has an
interest are currently pending.
(3) Out-of-band state. This is a special work state. Transitions to normal
states can be overridden when an unexpected event occurs (eg. I/O error).
Instead the dispatcher disables and clears the OOB event and transits to
the specified work state. This then acts as an ordinary work state,
though object->state points to the overridden destination. Returning
NO_TRANSIT resumes the overridden transition.
In addition, the states have names in their definitions, so there's no need for
tables of state names. Further, the EV_REQUEUE event is no longer necessary as
that is automatic for work states.
Since the states are now separate structs rather than values in an enum, it's
not possible to use comparisons other than (non-)equality between them, so use
some object->flags to indicate what phase an object is in.
The EV_RELEASE, EV_RETIRE and EV_WITHDRAW events have been squished into one
(EV_KILL). An object flag now carries the information about retirement.
Similarly, the RELEASING, RECYCLING and WITHDRAWING states have been merged
into an KILL_OBJECT state and additional states have been added for handling
waiting dependent objects (JUMPSTART_DEPS and KILL_DEPENDENTS).
A state has also been added for synchronising with parent object initialisation
(WAIT_FOR_PARENT) and another for initiating look up (PARENT_READY).
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Wrap checks on object state (mostly outside of fs/fscache/object.c) with
inline functions so that the mechanism can be replaced.
Some of the state checks within object.c are left as-is as they will be
replaced.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Uninline fscache_object_init() so as not to expose some of the FS-Cache
internals to the cache backend.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
The spinlock() within the condition in while() will cause a compile error
if it is not a function. This is not a problem on mainline but it does not
look pretty and there is no reason to do it that way.
That patch writes it a little differently and avoids the double condition.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
There is a kernel memory leak observed when the proc file
/proc/fs/fscache/stats is read.
The reason is that in fscache_stats_open, single_open is called and the
respective release function is not called during release. Hence fix
with correct release function - single_release().
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=57101
Signed-off-by: Anurup m <anurup.m@huawei.com>
Cc: shyju pv <shyju.pv@huawei.com>
Cc: Sanil kumar <sanil.kumar@huawei.com>
Cc: Nataraj m <nataraj.m@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark as cancelled an operation that is in progress rather than pending at the
time it is cancelled, and call fscache_complete_op() to cancel an operation so
that blocked ops can be started.
Signed-off-by: David Howells <dhowells@redhat.com>
In fscache_write_op(), if the object is determined to have become inactive or
to have lost its cookie, we don't move the operation state from in-progress,
and so an assertion in fscache_put_operation() fails with an assertion (see
below).
Instrumenting fscache_op_work_func() indicates that it called
fscache_write_op() before calling fscache_put_operation() - where the assertion
failed. The assertion at line 433 indicates that the operation state is
IN_PROGRESS rather than being COMPLETE or CANCELLED.
Instrumenting fscache_write_op() showed that it was being called on an object
that had had its cookie removed and that this was due to relinquishment of the
cookie by the netfs. At this point fscache no longer has access to the pages
of netfs data that were requested to be written, and so simply cancelling the
operation is the thing to do.
FS-Cache: Assertion failed
3 == 5 is false
------------[ cut here ]------------
kernel BUG at fs/fscache/operation.c:433!
invalid opcode: 0000 [#1] SMP
Modules linked in: cachefiles(F) nfsv4(F) nfsv3(F) nfsv2(F) nfs(F) fscache(F) auth_rpcgss(F) nfs_acl(F) lockd(F) sunrpc(F)
CPU 0
Pid: 1035, comm: kworker/u:3 Tainted: GF 3.7.0-rc8-fsdevel+ #411 /DG965RY
RIP: 0010:[<ffffffffa007db22>] [<ffffffffa007db22>] fscache_put_operation+0x11a/0x2ed [fscache]
RSP: 0018:ffff88003e32bcf8 EFLAGS: 00010296
RAX: 000000000000000f RBX: ffff88001818eb78 RCX: ffffffff6c102000
RDX: ffffffff8102d1ad RSI: ffffffff6c102000 RDI: ffffffff8102d1d6
RBP: ffff88003e32bd18 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa00811da
R13: 0000000000000001 R14: 0000000100625d26 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fff7dd31c68 CR3: 000000003d730000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:3 (pid: 1035, threadinfo ffff88003e32a000, task ffff88003bb38080)
Stack:
ffffffff8102d1ad ffff88001818eb78 ffffffffa00811da 0000000000000001
ffff88003e32bd48 ffffffffa007f0ad ffff88001818eb78 ffffffff819583c0
ffff88003df24e00 ffff88003882c3e0 ffff88003e32bde8 ffffffff81042de0
Call Trace:
[<ffffffff8102d1ad>] ? vprintk_emit+0x3c6/0x41a
[<ffffffffa00811da>] ? __fscache_read_or_alloc_pages+0x4bc/0x4bc [fscache]
[<ffffffffa007f0ad>] fscache_op_work_func+0xec/0x123 [fscache]
[<ffffffff81042de0>] process_one_work+0x21c/0x3b0
[<ffffffff81042d82>] ? process_one_work+0x1be/0x3b0
[<ffffffffa007efc1>] ? fscache_operation_gc+0x23e/0x23e [fscache]
[<ffffffff8104332e>] worker_thread+0x202/0x2df
[<ffffffff8104312c>] ? rescuer_thread+0x18e/0x18e
[<ffffffff81047c1c>] kthread+0xd0/0xd8
[<ffffffff81421bfa>] ? _raw_spin_unlock_irq+0x29/0x3e
[<ffffffff81047b4c>] ? __init_kthread_worker+0x55/0x55
[<ffffffff814227ec>] ret_from_fork+0x7c/0xb0
[<ffffffff81047b4c>] ? __init_kthread_worker+0x55/0x55
Signed-off-by: David Howells <dhowells@redhat.com>
Add a missing transition to the FS-Cache object state machine to handle an
invalidation event occuring between the back end completing the object lookup
by calling fscache_obtained_object() (which moves to state OBJECT_AVAILABLE)
and the backend returning to fscache_lookup_object() and thence to
fscache_object_state_machine() which then does a goto lookup_transit to handle
the transition - but lookup_transit doesn't handle EV_INVALIDATE.
Without this, the following BUG can be logged:
FS-Cache: Unsupported event 2 [5/f7] in state OBJECT_AVAILABLE
------------[ cut here ]------------
kernel BUG at fs/fscache/object.c:357!
Where event 2 is EV_INVALIDATE.
Signed-off-by: David Howells <dhowells@redhat.com>
nfs_migrate_page() does not wait for FS-Cache to finish with a page, probably
leading to the following bad-page-state:
BUG: Bad page state in process python-bin pfn:17d39b
page:ffffea00053649e8 flags:004000000000100c count:0 mapcount:0 mapping:(null)
index:38686 (Tainted: G B ---------------- )
Pid: 31053, comm: python-bin Tainted: G B ----------------
2.6.32-71.24.1.el6.x86_64 #1
Call Trace:
[<ffffffff8111bfe7>] bad_page+0x107/0x160
[<ffffffff8111ee69>] free_hot_cold_page+0x1c9/0x220
[<ffffffff8111ef19>] __pagevec_free+0x59/0xb0
[<ffffffff8104b988>] ? flush_tlb_others_ipi+0x128/0x130
[<ffffffff8112230c>] release_pages+0x21c/0x250
[<ffffffff8115b92a>] ? remove_migration_pte+0x28a/0x2b0
[<ffffffff8115f3f8>] ? mem_cgroup_get_reclaim_stat_from_page+0x18/0x70
[<ffffffff81122687>] ____pagevec_lru_add+0x167/0x180
[<ffffffff811226f8>] __lru_cache_add+0x58/0x70
[<ffffffff81122731>] lru_cache_add_lru+0x21/0x40
[<ffffffff81123f49>] putback_lru_page+0x69/0x100
[<ffffffff8115c0bd>] migrate_pages+0x13d/0x5d0
[<ffffffff81122687>] ? ____pagevec_lru_add+0x167/0x180
[<ffffffff81152ab0>] ? compaction_alloc+0x0/0x370
[<ffffffff8115255c>] compact_zone+0x4cc/0x600
[<ffffffff8111cfac>] ? get_page_from_freelist+0x15c/0x820
[<ffffffff810672f4>] ? check_preempt_wakeup+0x1c4/0x3c0
[<ffffffff8115290e>] compact_zone_order+0x7e/0xb0
[<ffffffff81152a49>] try_to_compact_pages+0x109/0x170
[<ffffffff8111e94d>] __alloc_pages_nodemask+0x5ed/0x850
[<ffffffff814c9136>] ? thread_return+0x4e/0x778
[<ffffffff81150d43>] alloc_pages_vma+0x93/0x150
[<ffffffff81167ea5>] do_huge_pmd_anonymous_page+0x135/0x340
[<ffffffff814cb6f6>] ? rwsem_down_read_failed+0x26/0x30
[<ffffffff81136755>] handle_mm_fault+0x245/0x2b0
[<ffffffff814ce383>] do_page_fault+0x123/0x3a0
[<ffffffff814cbdf5>] page_fault+0x25/0x30
nfs_migrate_page() calls nfs_fscache_release_page() which doesn't actually wait
- even if __GFP_WAIT is set. The reason that doesn't wait is that
fscache_maybe_release_page() might deadlock the allocator as the work threads
writing to the cache may all end up sleeping on memory allocation.
However, I wonder if that is actually a problem. There are a number of things
I can do to deal with this:
(1) Make nfs_migrate_page() wait.
(2) Make fscache_maybe_release_page() honour the __GFP_WAIT flag.
(3) Set a timeout around the wait.
(4) Make nfs_migrate_page() return an error if the page is still busy.
For the moment, I'll select (2) and (4).
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
The function to submit an exclusive op (fscache_submit_exclusive_op()) can BUG
if there's been an I/O error because it may see the parent cache object in an
unexpected state. It should only BUG if there hasn't been an I/O error.
In this case the problem was produced by remounting the cache partition to be
R/O. The EROFS state was detected and the cache was aborted, but not
everything handled the aborting correctly.
SysRq : Emergency Remount R/O
EXT4-fs (sda6): re-mounted. Opts: (null)
Emergency Remount complete
CacheFiles: I/O Error: Failed to update xattr with error -30
FS-Cache: Cache cachefiles stopped due to I/O error
------------[ cut here ]------------
kernel BUG at fs/fscache/operation.c:128!
invalid opcode: 0000 [#1] SMP
CPU 0
Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc
Pid: 6612, comm: kworker/u:2 Not tainted 3.1.0-rc8-fsdevel+ #1093 /DG965RY
RIP: 0010:[<ffffffffa00739c0>] [<ffffffffa00739c0>] fscache_submit_exclusive_op+0x2ad/0x2c2 [fscache]
RSP: 0018:ffff880000853d40 EFLAGS: 00010206
RAX: ffff880038ac72a8 RBX: ffff8800181f2260 RCX: ffffffff81f2b2b0
RDX: 0000000000000001 RSI: ffffffff8179a478 RDI: ffff8800181f2280
RBP: ffff880000853d60 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff880038ac7268
R13: ffff8800181f2280 R14: ffff88003a359190 R15: 000000010122b162
FS: 0000000000000000(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000034cc4a77f0 CR3: 0000000010e96000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:2 (pid: 6612, threadinfo ffff880000852000, task ffff880014c3c040)
Stack:
ffff8800181f2260 ffff8800181f2310 ffff880038ac7268 ffff8800181f2260
ffff880000853dc0 ffffffffa0072375 ffff880037ecfe00 ffff88003a359198
ffff880000853dc0 0000000000000246 0000000000000000 ffff88000a91d308
Call Trace:
[<ffffffffa0072375>] fscache_object_work_func+0x792/0xe65 [fscache]
[<ffffffff81047e44>] process_one_work+0x1eb/0x37f
[<ffffffff81047de6>] ? process_one_work+0x18d/0x37f
[<ffffffffa0071be3>] ? fscache_enqueue_dependents+0xd8/0xd8 [fscache]
[<ffffffff810482e4>] worker_thread+0x15a/0x21a
[<ffffffff8104818a>] ? rescuer_thread+0x188/0x188
[<ffffffff8104bf96>] kthread+0x7f/0x87
[<ffffffff813ad6f4>] kernel_thread_helper+0x4/0x10
[<ffffffff81026b98>] ? finish_task_switch+0x45/0xc0
[<ffffffff813abd1d>] ? retint_restore_args+0xe/0xe
[<ffffffff8104bf17>] ? __init_kthread_worker+0x53/0x53
[<ffffffff813ad6f0>] ? gs_change+0xb/0xb
Signed-off-by: David Howells <dhowells@redhat.com>
Limit the number of I/O error reports for a cache to 1 to prevent massive
amounts of noise. After the first I/O error the cache is taken off line
automatically, so must be restarted to resume caching.
Signed-off-by: David Howells <dhowells@redhat.com>
Don't mask off the object event mask when printing it. That way it can be seen
if threre are bits set that shouldn't be.
Signed-off-by: David Howells <dhowells@redhat.com>
Initialise the object event mask with the calculated mask rather than unmasking
undefined events also.
Signed-off-by: David Howells <dhowells@redhat.com>
Provide a proper invalidation method rather than relying on the netfs retiring
the cookie it has and getting a new one. The problem with this is that isn't
easy for the netfs to make sure that it has completed/cancelled all its
outstanding storage and retrieval operations on the cookie it is retiring.
Instead, have the cache provide an invalidation method that will cancel or wait
for all currently outstanding operations before invalidating the cache, and
will cause new operations to queue up behind that. Whilst invalidation is in
progress, some requests will be rejected until the cache can stack a barrier on
the operation queue to cause new operations to be deferred behind it.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the state management of internal fscache operations and the accounting of
what operations are in what states.
This is done by:
(1) Give struct fscache_operation a enum variable that directly represents the
state it's currently in, rather than spreading this knowledge over a bunch
of flags, who's processing the operation at the moment and whether it is
queued or not.
This makes it easier to write assertions to check the state at various
points and to prevent invalid state transitions.
(2) Add an 'operation complete' state and supply a function to indicate the
completion of an operation (fscache_op_complete()) and make things call
it. The final call to fscache_put_operation() can then check that an op
in the appropriate state (complete or cancelled).
(3) Adjust the use of object->n_ops, ->n_in_progress, ->n_exclusive to better
govern the state of an object:
(a) The ->n_ops is now the number of extant operations on the object
and is now decremented by fscache_put_operation() only.
(b) The ->n_in_progress is simply the number of objects that have been
taken off of the object's pending queue for the purposes of being
run. This is decremented by fscache_op_complete() only.
(c) The ->n_exclusive is the number of exclusive ops that have been
submitted and queued or are in progress. It is decremented by
fscache_op_complete() and by fscache_cancel_op().
fscache_put_operation() and fscache_operation_gc() now no longer try to
clean up ->n_exclusive and ->n_in_progress. That was leading to double
decrements against fscache_cancel_op().
fscache_cancel_op() now no longer decrements ->n_ops. That was leading to
double decrements against fscache_put_operation().
fscache_submit_exclusive_op() now decides whether it has to queue an op
based on ->n_in_progress being > 0 rather than ->n_ops > 0 as the latter
will persist in being true even after all preceding operations have been
cancelled or completed. Furthermore, if an object is active and there are
runnable ops against it, there must be at least one op running.
(4) Add a remaining-pages counter (n_pages) to struct fscache_retrieval and
provide a function to record completion of the pages as they complete.
When n_pages reaches 0, the operation is deemed to be complete and
fscache_op_complete() is called.
Add calls to fscache_retrieval_complete() anywhere we've finished with a
page we've been given to read or allocate for. This includes places where
we just return pages to the netfs for reading from the server and where
accessing the cache fails and we discard the proposed netfs page.
The bugs in the unfixed state management manifest themselves as oopses like the
following where the operation completion gets out of sync with return of the
cookie by the netfs. This is possible because the cache unlocks and returns
all the netfs pages before recording its completion - which means that there's
nothing to stop the netfs discarding them and returning the cookie.
FS-Cache: Cookie 'NFS.fh' still has outstanding reads
------------[ cut here ]------------
kernel BUG at fs/fscache/cookie.c:519!
invalid opcode: 0000 [#1] SMP
CPU 1
Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc
Pid: 400, comm: kswapd0 Not tainted 3.1.0-rc7-fsdevel+ #1090 /DG965RY
RIP: 0010:[<ffffffffa007050a>] [<ffffffffa007050a>] __fscache_relinquish_cookie+0x170/0x343 [fscache]
RSP: 0018:ffff8800368cfb00 EFLAGS: 00010282
RAX: 000000000000003c RBX: ffff880023cc8790 RCX: 0000000000000000
RDX: 0000000000002f2e RSI: 0000000000000001 RDI: ffffffff813ab86c
RBP: ffff8800368cfb50 R08: 0000000000000002 R09: 0000000000000000
R10: ffff88003a1b7890 R11: ffff88001df6e488 R12: ffff880023d8ed98
R13: ffff880023cc8798 R14: 0000000000000004 R15: ffff88003b8bf370
FS: 0000000000000000(0000) GS:ffff88003bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000008ba008 CR3: 0000000023d93000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 400, threadinfo ffff8800368ce000, task ffff88003b8bf040)
Stack:
ffff88003b8bf040 ffff88001df6e528 ffff88001df6e528 ffffffffa00b46b0
ffff88003b8bf040 ffff88001df6e488 ffff88001df6e620 ffffffffa00b46b0
ffff88001ebd04c8 0000000000000004 ffff8800368cfb70 ffffffffa00b2c91
Call Trace:
[<ffffffffa00b2c91>] nfs_fscache_release_inode_cookie+0x3b/0x47 [nfs]
[<ffffffffa008f25f>] nfs_clear_inode+0x3c/0x41 [nfs]
[<ffffffffa0090df1>] nfs4_evict_inode+0x2f/0x33 [nfs]
[<ffffffff810d8d47>] evict+0xa1/0x15c
[<ffffffff810d8e2e>] dispose_list+0x2c/0x38
[<ffffffff810d9ebd>] prune_icache_sb+0x28c/0x29b
[<ffffffff810c56b7>] prune_super+0xd5/0x140
[<ffffffff8109b615>] shrink_slab+0x102/0x1ab
[<ffffffff8109d690>] balance_pgdat+0x2f2/0x595
[<ffffffff8103e009>] ? process_timeout+0xb/0xb
[<ffffffff8109dba3>] kswapd+0x270/0x289
[<ffffffff8104c5ea>] ? __init_waitqueue_head+0x46/0x46
[<ffffffff8109d933>] ? balance_pgdat+0x595/0x595
[<ffffffff8104bf7a>] kthread+0x7f/0x87
[<ffffffff813ad6b4>] kernel_thread_helper+0x4/0x10
[<ffffffff81026b98>] ? finish_task_switch+0x45/0xc0
[<ffffffff813abcdd>] ? retint_restore_args+0xe/0xe
[<ffffffff8104befb>] ? __init_kthread_worker+0x53/0x53
[<ffffffff813ad6b0>] ? gs_change+0xb/0xb
Signed-off-by: David Howells <dhowells@redhat.com>
Make fscache_relinquish_cookie() log a warning and wait if there are any
outstanding reads left on the cookie it was given.
Signed-off-by: David Howells <dhowells@redhat.com>
Check that the netfs isn't trying to relinquish a cookie that still has read
operations in progress upon it. If there are, then give log a warning and BUG.
Signed-off-by: David Howells <dhowells@redhat.com>
Downgrade the requirements passed to the allocator in the gfp flags parameter.
FS-Cache/CacheFiles can handle OOM conditions simply by aborting the attempt to
store an object or a page in the cache.
Signed-off-by: David Howells <dhowells@redhat.com>
Under some circumstances CacheFiles defers the marking of pages with PG_fscache
so that it can take advantage of pagevecs to reduce the number of calls to
fscache_mark_pages_cached() and the netfs's hook to keep track of this.
There are, however, two problems with this:
(1) It can lead to the PG_fscache mark being applied _after_ the page is set
PG_uptodate and unlocked (by the call to fscache_end_io()).
(2) CacheFiles's ref on the page is dropped immediately following
fscache_end_io() - and so may not still be held when the mark is applied.
This can lead to the page being passed back to the allocator before the
mark is applied.
Fix this by, where appropriate, marking the page before calling
fscache_end_io() and releasing the page. This means that we can't take
advantage of pagevecs and have to make a separate call for each page to the
marking routines.
The symptoms of this are Bad Page state errors cropping up under memory
pressure, for example:
BUG: Bad page state in process tar pfn:002da
page:ffffea0000009fb0 count:0 mapcount:0 mapping: (null) index:0x1447
page flags: 0x1000(private_2)
Pid: 4574, comm: tar Tainted: G W 3.1.0-rc4-fsdevel+ #1064
Call Trace:
[<ffffffff8109583c>] ? dump_page+0xb9/0xbe
[<ffffffff81095916>] bad_page+0xd5/0xea
[<ffffffff81095d82>] get_page_from_freelist+0x35b/0x46a
[<ffffffff810961f3>] __alloc_pages_nodemask+0x362/0x662
[<ffffffff810989da>] __do_page_cache_readahead+0x13a/0x267
[<ffffffff81098942>] ? __do_page_cache_readahead+0xa2/0x267
[<ffffffff81098d7b>] ra_submit+0x1c/0x20
[<ffffffff8109900a>] ondemand_readahead+0x28b/0x29a
[<ffffffff81098ee2>] ? ondemand_readahead+0x163/0x29a
[<ffffffff810990ce>] page_cache_sync_readahead+0x38/0x3a
[<ffffffff81091d8a>] generic_file_aio_read+0x2ab/0x67e
[<ffffffffa008cfbe>] nfs_file_read+0xa4/0xc9 [nfs]
[<ffffffff810c22c4>] do_sync_read+0xba/0xfa
[<ffffffff81177a47>] ? security_file_permission+0x7b/0x84
[<ffffffff810c25dd>] ? rw_verify_area+0xab/0xc8
[<ffffffff810c29a4>] vfs_read+0xaa/0x13a
[<ffffffff810c2a79>] sys_read+0x45/0x6c
[<ffffffff813ac37b>] system_call_fastpath+0x16/0x1b
As can be seen, PG_private_2 (== PG_fscache) is set in the page flags.
Instrumenting fscache_mark_pages_cached() to verify whether page->mapping was
set appropriately showed that sometimes it wasn't. This led to the discovery
that sometimes the page has apparently been reclaimed by the time the marker
got to see it.
Reported-by: M. Stevens <m@tippett.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
The compiler, at least for ix86 and m68k, validly warns that the
comparison:
next <= (loff_t)-1
is always true (and it's always true also for x86-64 and probably all
other arches - as long as pgoff_t isn't wider than loff_t). The
intention appears to be to avoid wrapping of "next", so rather than
eliminating the pointless comparison, fix the loop to indeed get exited
when "next" would otherwise wrap.
On m68k the following warning is observed:
fs/fscache/page.c: In function '__fscache_uncache_all_inode_pages':
fs/fscache/page.c:979: warning: comparison is always false due to limited range of data type
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reported-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Suresh Jayaraman <sjayaraman@suse.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an FS-Cache helper to bulk uncache pages on an inode. This will
only work for the circumstance where the pages in the cache correspond
1:1 with the pages attached to an inode's page cache.
This is required for CIFS and NFS: When disabling inode cookie, we were
returning the cookie and setting cifsi->fscache to NULL but failed to
invalidate any previously mapped pages. This resulted in "Bad page
state" errors and manifested in other kind of errors when running
fsstress. Fix it by uncaching mapped pages when we disable the inode
cookie.
This patch should fix the following oops and "Bad page state" errors
seen during fsstress testing.
------------[ cut here ]------------
kernel BUG at fs/cachefiles/namei.c:201!
invalid opcode: 0000 [#1] SMP
Pid: 5, comm: kworker/u:0 Not tainted 2.6.38.7-30.fc15.x86_64 #1 Bochs Bochs
RIP: 0010: cachefiles_walk_to_object+0x436/0x745 [cachefiles]
RSP: 0018:ffff88002ce6dd00 EFLAGS: 00010282
RAX: ffff88002ef165f0 RBX: ffff88001811f500 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000282
RBP: ffff88002ce6dda0 R08: 0000000000000100 R09: ffffffff81b3a300
R10: 0000ffff00066c0a R11: 0000000000000003 R12: ffff88002ae54840
R13: ffff88002ae54840 R14: ffff880029c29c00 R15: ffff88001811f4b0
FS: 00007f394dd32720(0000) GS:ffff88002ef00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fffcb62ddf8 CR3: 000000001825f000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:0 (pid: 5, threadinfo ffff88002ce6c000, task ffff88002ce55cc0)
Stack:
0000000000000246 ffff88002ce55cc0 ffff88002ce6dd58 ffff88001815dc00
ffff8800185246c0 ffff88001811f618 ffff880029c29d18 ffff88001811f380
ffff88002ce6dd50 ffffffff814757e4 ffff88002ce6dda0 ffffffff8106ac56
Call Trace:
cachefiles_lookup_object+0x78/0xd4 [cachefiles]
fscache_lookup_object+0x131/0x16d [fscache]
fscache_object_work_func+0x1bc/0x669 [fscache]
process_one_work+0x186/0x298
worker_thread+0xda/0x15d
kthread+0x84/0x8c
kernel_thread_helper+0x4/0x10
RIP cachefiles_walk_to_object+0x436/0x745 [cachefiles]
---[ end trace 1d481c9af1804caa ]---
I tested the uncaching by the following means:
(1) Create a big file on my NFS server (104857600 bytes).
(2) Read the file into the cache with md5sum on the NFS client. Look in
/proc/fs/fscache/stats:
Pages : mrk=25601 unc=0
(3) Open the file for read/write ("bash 5<>/warthog/bigfile"). Look in proc
again:
Pages : mrk=25601 unc=25601
Reported-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no CONFIG_WORKQUEUE_DEBUGFS any more, so this code is dead.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fscache_submit_exclusive_op() adds an operation to the pending list if
other operations are pending. Fix the check for pending ops as n_ops
must be greater than 0 at the point it is checked as it is incremented
immediately before under lock.
Signed-off-by: Akshat Aranya <aranya@nec-labs.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a dummy printk function for the maintenance of unused printks through gcc
format checking, and also so that side-effect checking is maintained too.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 8b8edefa (fscache: convert object to use workqueue instead of
slow-work) made fscache_exit() call unregister_sysctl_table()
unconditionally breaking build when sysctl is disabled. Fix it by
putting it inside CONFIG_SYSCTL.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: David Howells <dhowells@redhat.com>
Make fscache operation to use only workqueue instead of combination of
workqueue and slow-work. FSCACHE_OP_SLOW is dropped and
FSCACHE_OP_FAST is renamed to FSCACHE_OP_ASYNC and uses newly added
fscache_op_wq workqueue to execute op->processor().
fscache_operation_init_slow() is dropped and fscache_operation_init()
now takes @processor argument directly.
* Unbound workqueue is used.
* fscache_retrieval_work() is no longer necessary as OP_ASYNC now does
the equivalent thing.
* sysctl fscache.operation_max_active added to control concurrency.
The default value is nr_cpus clamped between 2 and
WQ_UNBOUND_MAX_ACTIVE.
* debugfs support is dropped for now. Tracing API based debug
facility is planned to be added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Howells <dhowells@redhat.com>
Make fscache object state transition callbacks use workqueue instead
of slow-work. New dedicated unbound CPU workqueue fscache_object_wq
is created. get/put callbacks are renamed and modified to take
@object and called directly from the enqueue wrapper and the work
function. While at it, make all open coded instances of get/put to
use fscache_get/put_object().
* Unbound workqueue is used.
* work_busy() output is printed instead of slow-work flags in object
debugging outputs. They mean basically the same thing bit-for-bit.
* sysctl fscache.object_max_active added to control concurrency. The
default value is nr_cpus clamped between 4 and
WQ_UNBOUND_MAX_ACTIVE.
* slow_work_sleep_till_thread_needed() is replaced with fscache
private implementation fscache_object_sleep_till_congested() which
waits on fscache_object_wq congestion.
* debugfs support is dropped for now. Tracing API based debug
facility is planned to be added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Howells <dhowells@redhat.com>
fscache_write_op() makes unnecessary checks of the page variable to see if it
is NULL. It can't be NULL at those points as the kernel would already have
crashed a little higher up where we examined page->index.
Furthermore, unless radix_tree_gang_lookup_tag() can return 1 but no page, a
NULL pointer crash should not be encountered there as we can only get there if
r_t_g_l_t() returned 1.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs/fscache/object-list.c: In function 'fscache_objlist_lookup':
fs/fscache/object-list.c:105: warning: cast to pointer from integer of different size
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Order the debugfs statistics correctly. The values displayed through a
seq_printf() statement should be in the same order as the names in the
format string.
In the 'Lookups' line, objects created ('crt=') and lookups timed out
('tmo=') have their values transposed.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
CONFIG_SLOW_WORK_PROC was changed to CONFIG_SLOW_WORK_DEBUG, but not in all
instances. Change the remaining instances. This makes the debugfs file
display the time mark and the owner's description again.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sparse complained about this missing spin_unlock()
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the EXPERIMENTAL flag from FS-Cache so that Ubuntu can make use of the
facility.
Signed-off-by: Christian Kujau <lists@nerdbynature.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton's compiler sees the following warning in FS-Cache:
fs/fscache/object-list.c: In function 'fscache_objlist_lookup':
fs/fscache/object-list.c:94: warning: 'obj' may be used uninitialized in this function
which my compiler doesn't. This is a false positive as obj can only be
used in the comparison against minobj if minobj has been set to something
other than NULL, but for that to happen, obj has to be first set to
something.
Deal with this by preclearing obj too.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide nop fscache_stat_d() macro if CONFIG_FSCACHE_STATS=n lest errors like
the following occur:
fs/fscache/cache.c: In function 'fscache_withdraw_cache':
fs/fscache/cache.c:386: error: implicit declaration of function 'fscache_stat_d'
fs/fscache/cache.c:386: error: 'fscache_n_cop_sync_cache' undeclared (first use in this function)
fs/fscache/cache.c:386: error: (Each undeclared identifier is reported only once
fs/fscache/cache.c:386: error: for each function it appears in.)
fs/fscache/cache.c:392: error: 'fscache_n_cop_dissociate_pages' undeclared (first use in this function)
Signed-off-by: David Howells <dhowells@redhat.com>
Catch an overly long wait for an old, dying active object when we want to
replace it with a new one. The probability is that all the slow-work threads
are hogged, and the delete can't get a look in.
What we do instead is:
(1) if there's nothing in the slow work queue, we sleep until either the dying
object has finished dying or there is something in the slow work queue
behind which we can queue our object.
(2) if there is something in the slow work queue, we return ETIMEDOUT to
fscache_lookup_object(), which then puts us back on the slow work queue,
presumably behind the deletion that we're blocked by. We are then
deferred for a while until we work our way back through the queue -
without blocking a slow-work thread unnecessarily.
A backtrace similar to the following may appear in the log without this patch:
INFO: task kslowd004:5711 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kslowd004 D 0000000000000000 0 5711 2 0x00000080
ffff88000340bb80 0000000000000046 ffff88002550d000 0000000000000000
ffff88002550d000 0000000000000007 ffff88000340bfd8 ffff88002550d2a8
000000000000ddf0 00000000000118c0 00000000000118c0 ffff88002550d2a8
Call Trace:
[<ffffffff81058e21>] ? trace_hardirqs_on+0xd/0xf
[<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
[<ffffffffa011c4e1>] cachefiles_wait_bit+0x9/0xd [cachefiles]
[<ffffffff81353153>] __wait_on_bit+0x43/0x76
[<ffffffff8111ae39>] ? ext3_xattr_get+0x1ec/0x270
[<ffffffff813531ef>] out_of_line_wait_on_bit+0x69/0x74
[<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
[<ffffffff8104c125>] ? wake_bit_function+0x0/0x2e
[<ffffffffa011bc79>] cachefiles_mark_object_active+0x203/0x23b [cachefiles]
[<ffffffffa011c209>] cachefiles_walk_to_object+0x558/0x827 [cachefiles]
[<ffffffffa011a429>] cachefiles_lookup_object+0xac/0x12a [cachefiles]
[<ffffffffa00aa1e9>] fscache_lookup_object+0x1c7/0x214 [fscache]
[<ffffffffa00aafc5>] fscache_object_state_machine+0xa5/0x52d [fscache]
[<ffffffffa00ab4ac>] fscache_object_slow_work_execute+0x5f/0xa0 [fscache]
[<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
[<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
[<ffffffff8104be91>] kthread+0x7a/0x82
[<ffffffff8100beda>] child_rip+0xa/0x20
[<ffffffff8100b87c>] ? restore_args+0x0/0x30
[<ffffffff8104be17>] ? kthread+0x0/0x82
[<ffffffff8100bed0>] ? child_rip+0x0/0x20
1 lock held by kslowd004/5711:
#0: (&sb->s_type->i_mutex_key#7/1){+.+.+.}, at: [<ffffffffa011be64>] cachefiles_walk_to_object+0x1b3/0x827 [cachefiles]
Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache objects have an FSCACHE_OBJECT_EV_REQUEUE event that can theoretically
be raised to ask the state machine to requeue the object for further processing
before the work function returns to the slow-work facility.
However, fscache_object_work_execute() was clearing that bit before checking
the event mask to see whether the object has any pending events that require it
to be requeued immediately.
Instead, the bit should be cleared after the check and enqueue.
Signed-off-by: David Howells <dhowells@redhat.com>
Start processing an object's operations when that object moves into the DYING
state as the object cannot be destroyed until all its outstanding operations
have completed.
Furthermore, make sure that read and allocation operations handle being woken
up on a dead object. Such events are recorded in the Allocs.abt and
Retrvls.abt statistics as viewable through /proc/fs/fscache/stats.
The code for waiting for object activation for the read and allocation
operations is also extracted into its own function as it is much the same in
all cases, differing only in the stats incremented.
Signed-off-by: David Howells <dhowells@redhat.com>
We must make sure that FSCACHE_COOKIE_LOOKING_UP is cleared on lookup failure
(if an object reaches the LC_DYING state), and we should clear it before
clearing FSCACHE_COOKIE_CREATING.
If this doesn't happen then fscache_wait_for_deferred_lookup() may hold
allocation and retrieval operations indefinitely until they're interrupted by
signals - which in turn pins the dying object until they go away.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a stat counter to count retirement events rather than ordinary release
events (the retire argument to fscache_relinquish_cookie()).
Signed-off-by: David Howells <dhowells@redhat.com>
Handle netfs pages that the vmscan algorithm wants to evict from the pagecache
under OOM conditions, but that are waiting for write to the cache. Under these
conditions, vmscan calls the releasepage() function of the netfs, asking if a
page can be discarded.
The problem is typified by the following trace of a stuck process:
kslowd005 D 0000000000000000 0 4253 2 0x00000080
ffff88001b14f370 0000000000000046 ffff880020d0d000 0000000000000007
0000000000000006 0000000000000001 ffff88001b14ffd8 ffff880020d0d2a8
000000000000ddf0 00000000000118c0 00000000000118c0 ffff880020d0d2a8
Call Trace:
[<ffffffffa00782d8>] __fscache_wait_on_page_write+0x8b/0xa7 [fscache]
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffffa0078240>] ? __fscache_check_page_write+0x63/0x70 [fscache]
[<ffffffffa00b671d>] nfs_fscache_release_page+0x4e/0xc4 [nfs]
[<ffffffffa00927f0>] nfs_release_page+0x3c/0x41 [nfs]
[<ffffffff810885d3>] try_to_release_page+0x32/0x3b
[<ffffffff81093203>] shrink_page_list+0x316/0x4ac
[<ffffffff8109372b>] shrink_inactive_list+0x392/0x67c
[<ffffffff813532fa>] ? __mutex_unlock_slowpath+0x100/0x10b
[<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130
[<ffffffff8135330e>] ? mutex_unlock+0x9/0xb
[<ffffffff81093aa2>] shrink_list+0x8d/0x8f
[<ffffffff81093d1c>] shrink_zone+0x278/0x33c
[<ffffffff81052d6c>] ? ktime_get_ts+0xad/0xba
[<ffffffff81094b13>] try_to_free_pages+0x22e/0x392
[<ffffffff81091e24>] ? isolate_pages_global+0x0/0x212
[<ffffffff8108e743>] __alloc_pages_nodemask+0x3dc/0x5cf
[<ffffffff81089529>] grab_cache_page_write_begin+0x65/0xaa
[<ffffffff8110f8c0>] ext3_write_begin+0x78/0x1eb
[<ffffffff81089ec5>] generic_file_buffered_write+0x109/0x28c
[<ffffffff8103cb69>] ? current_fs_time+0x22/0x29
[<ffffffff8108a509>] __generic_file_aio_write+0x350/0x385
[<ffffffff8108a588>] ? generic_file_aio_write+0x4a/0xae
[<ffffffff8108a59e>] generic_file_aio_write+0x60/0xae
[<ffffffff810b2e82>] do_sync_write+0xe3/0x120
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810b18e1>] ? __dentry_open+0x1a5/0x2b8
[<ffffffff810b1a76>] ? dentry_open+0x82/0x89
[<ffffffffa00e693c>] cachefiles_write_page+0x298/0x335 [cachefiles]
[<ffffffffa0077147>] fscache_write_op+0x178/0x2c2 [fscache]
[<ffffffffa0075656>] fscache_op_execute+0x7a/0xd1 [fscache]
[<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
[<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
[<ffffffff8104be91>] kthread+0x7a/0x82
[<ffffffff8100beda>] child_rip+0xa/0x20
[<ffffffff8100b87c>] ? restore_args+0x0/0x30
[<ffffffff8102ef83>] ? tg_shares_up+0x171/0x227
[<ffffffff8104be17>] ? kthread+0x0/0x82
[<ffffffff8100bed0>] ? child_rip+0x0/0x20
In the above backtrace, the following is happening:
(1) A page storage operation is being executed by a slow-work thread
(fscache_write_op()).
(2) FS-Cache farms the operation out to the cache to perform
(cachefiles_write_page()).
(3) CacheFiles is then calling Ext3 to perform the actual write, using Ext3's
standard write (do_sync_write()) under KERNEL_DS directly from the netfs
page.
(4) However, for Ext3 to perform the write, it must allocate some memory, in
particular, it must allocate at least one page cache page into which it
can copy the data from the netfs page.
(5) Under OOM conditions, the memory allocator can't immediately come up with
a page, so it uses vmscan to find something to discard
(try_to_free_pages()).
(6) vmscan finds a clean netfs page it might be able to discard (possibly the
one it's trying to write out).
(7) The netfs is called to throw the page away (nfs_release_page()) - but it's
called with __GFP_WAIT, so the netfs decides to wait for the store to
complete (__fscache_wait_on_page_write()).
(8) This blocks a slow-work processing thread - possibly against itself.
The system ends up stuck because it can't write out any netfs pages to the
cache without allocating more memory.
To avoid this, we make FS-Cache cancel some writes that aren't in the middle of
actually being performed. This means that some data won't make it into the
cache this time. To support this, a new FS-Cache function is added
fscache_maybe_release_page() that replaces what the netfs releasepage()
functions used to do with respect to the cache.
The decisions fscache_maybe_release_page() makes are counted and displayed
through /proc/fs/fscache/stats on a line labelled "VmScan". There are four
counters provided: "nos=N" - pages that weren't pending storage; "gon=N" -
pages that were pending storage when we first looked, but weren't by the time
we got the object lock; "bsy=N" - pages that we ignored as they were actively
being written when we looked; and "can=N" - pages that we cancelled the storage
of.
What I'd really like to do is alter the behaviour of the cancellation
heuristics, depending on how necessary it is to expel pages. If there are
plenty of other pages that aren't waiting to be written to the cache that
could be ejected first, then it would be nice to hold up on immediate
cancellation of cache writes - but I don't see a way of doing that.
Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache doesn't correctly handle the netfs requesting a read from the cache
on an object that failed or was withdrawn by the cache. A trace similar to
the following might be seen:
CacheFiles: Lookup failed error -105
[exe ] unexpected submission OP165afe [OBJ6cac OBJECT_LC_DYING]
[exe ] objstate=OBJECT_LC_DYING [OBJECT_LC_DYING]
[exe ] objflags=0
[exe ] objevent=9 [fffffffffffffffb]
[exe ] ops=0 inp=0 exc=0
Pid: 6970, comm: exe Not tainted 2.6.32-rc6-cachefs #50
Call Trace:
[<ffffffffa0076477>] fscache_submit_op+0x3ff/0x45a [fscache]
[<ffffffffa0077997>] __fscache_read_or_alloc_pages+0x187/0x3c4 [fscache]
[<ffffffffa00b6480>] ? nfs_readpage_from_fscache_complete+0x0/0x66 [nfs]
[<ffffffffa00b6388>] __nfs_readpages_from_fscache+0x7e/0x176 [nfs]
[<ffffffff8108e483>] ? __alloc_pages_nodemask+0x11c/0x5cf
[<ffffffffa009d796>] nfs_readpages+0x114/0x1d7 [nfs]
[<ffffffff81090314>] __do_page_cache_readahead+0x15f/0x1ec
[<ffffffff81090228>] ? __do_page_cache_readahead+0x73/0x1ec
[<ffffffff810903bd>] ra_submit+0x1c/0x20
[<ffffffff810906bb>] ondemand_readahead+0x227/0x23a
[<ffffffff81090762>] page_cache_sync_readahead+0x17/0x19
[<ffffffff8108a99e>] generic_file_aio_read+0x236/0x5a0
[<ffffffffa00937bd>] nfs_file_read+0xe4/0xf3 [nfs]
[<ffffffff810b2fa2>] do_sync_read+0xe3/0x120
[<ffffffff81354cc3>] ? _spin_unlock_irq+0x2b/0x31
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff811848e5>] ? selinux_file_permission+0x5d/0x10f
[<ffffffff81352bdb>] ? thread_return+0x3e/0x101
[<ffffffff8117d7b0>] ? security_file_permission+0x11/0x13
[<ffffffff810b3b06>] vfs_read+0xaa/0x16f
[<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130
[<ffffffff810b3c84>] sys_read+0x45/0x6c
[<ffffffff8100ae2b>] system_call_fastpath+0x16/0x1b
The object state might also be OBJECT_DYING or OBJECT_WITHDRAWING.
This should be handled by simply rejecting the new operation with ENOBUFS.
There's no need to log an error for it. Events of this type now appear in the
stats file under Ops:rej.
Signed-off-by: David Howells <dhowells@redhat.com>
Don't delete pending pages from the page-store tracking tree, but rather send
them for another write as they've presumably been updated.
Signed-off-by: David Howells <dhowells@redhat.com>