local cifs functions (repost)
Using kernel crypto APIs for DES encryption during LM and NT hash generation
instead of local functions within cifs.
Source file smbdes.c is deleted sans four functions, one of which
uses ecb des functionality provided by kernel crypto APIs.
Remove function SMBOWFencrypt.
Add return codes to various functions such as calc_lanman_hash,
SMBencrypt, and SMBNTencrypt. Includes fix noticed by Dan Carpenter.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
CC: Dan Carpenter <error27@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Remove config flag CIFS_EXPERIMENTAL.
Do export operations under new config flag CIFS_NFSD_EXPORT
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
SMB2 is the followon to the CIFS (and SMB) protocols
and the default for Windows since Windows Vista, and also
now implemented by various non-Windows servers. SMB2
is more secure, has various performance advantages, including
larger i/o sizes, flow control, better caching model and more.
SMB2 also resolves some scalability limits in the cifs
protocol and adds many new features while being much
simpler (only a few dozen commands instead of hundreds)
and since the protocol is clearer it is
also more consistently implemented across servers
and thus easier to optimize.
After much discussion with Jeff Layton, Jeremy Allison
and others at Connectathon, we decided to move the smb2
code from a distinct .ko and fstype into distinct
C files that optionally build in cifs.ko. As a result
the Kconfig gets simpler.
To avoid destabilizing cifs, the smb2 code is going
to be moved into its own experimental CONFIG_CIFS_SMB2 ifdef
as it is merged and rereviewed. The changes to stable
cifs (builds with the smb2 ifdef off) are expected to be
fairly small.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
We were reserving MAX_USERNAME (now 256) on stack for
something which only needs to fit about 24 bytes ie
string krb50x + printf version of uid
Signed-off-by: Steve French <sfrench@us.ibm.com>
The patch below removes an extra "l" in the word.
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Recent Windows versions now create symlinks more frequently
and they do use this "reparse point" symlink mechanism. We can of course
do symlinks nicely to Samba and other servers which support the
CIFS Unix Extensions and we can also do SFU symlinks and "client only"
"MF" symlinks optionally, but for recent Windows we currently can not
handle the common "reparse point" symlinks fully, removing the caller
for this. We will need to extend and reenable this "reparse point" worker
code in cifs and fix cifs_symlink to call this. In the interim this code
has been moved to its own config option so it is not compiled in by default
until cifs_symlink fixed up (and tested) to use this.
CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The CIFSSMBNotify worker is unused, pending changes to allow it to be called
via inotify, so move it into its own experimental config option so it does
not get built in, until the necessary VFS support is fixed. It used to
be used in dnotify, but according to Jeff, inotify needs minor changes
before we can reenable this.
CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
ino is unused in function cifs_root_iget().
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
No functional changes requires that we eat errors from strtobool.
If people want to not do this, then it should be fixed at a later date.
V2: Simplification suggested by Rusty Russell removes the need for
additional variable ret.
Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
configfs: Fix race between configfs_readdir() and configfs_d_iput()
configfs: Don't try to d_delete() negative dentries.
ocfs2/dlm: Target node death during resource migration leads to thread spin
ocfs2: Skip mount recovery for hard-ro mounts
ocfs2/cluster: Heartbeat mismatch message improved
ocfs2/cluster: Increase the live threshold for global heartbeat
ocfs2/dlm: Use negotiated o2dlm protocol version
ocfs2: skip existing hole when removing the last extent_rec in punching-hole codes.
ocfs2: Initialize data_ac (might be used uninitialized)
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: don't delay blk_run_queue_async
scsi: remove performance regression due to async queue run
blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroup
block: rescan partitions on invalidated devices on -ENOMEDIA too
cdrom: always check_disk_change() on open
block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe drivers
configfs_readdir() will use the existing inode numbers of inodes in the
dcache, but it makes them up for attribute files that aren't currently
instantiated. There is a race where a closing attribute file can be
tearing down at the same time as configfs_readdir() is trying to get its
inode number.
We want to get the inode number of open attribute files, because they
should match while instantiated. We can't lock down the transition
where dentry->d_inode is set to NULL, so we just check for NULL there.
We can, however, ensure that an inode we find isn't iput() in
configfs_d_iput() until after we've accessed it.
Signed-off-by: Joel Becker <jlbec@evilplan.org>
When configfs is faking mkdir() on its subsystem or default group
objects, it starts by adding a negative dentry. It then tries to
instantiate the group. If that should fail, it must clean up after
itself.
I was using d_delete() here, but configfs_attach_group() promises to
return an empty dentry on error. d_delete() explodes with the entry
dentry. Let's try d_drop() instead. The unhashing is what we want for
our dentry.
Signed-off-by: Joel Becker <jlbec@evilplan.org>
As Metze pointed out, commit 84cdf74e broke mapchars option:
Commit "cifs: fix unaligned accesses in cifsConvertToUCS"
(84cdf74e80) does multiple steps
in just one commit (moving the function and changing it without
testing).
put_unaligned_le16(temp, &target[j]); is never called for any
codepoint the goes via the 'default' switch statement. As a result
we put just zero (or maybe uninitialized) bytes into the target
buffer.
His proposed patch looks correct, but doesn't apply to the current head
of the tree. This patch should also fix it.
Cc: <stable@kernel.org> # .38.x: 581ade4: cifs: clean up various nits in unicode routines (try #2)
Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The is_path_accessible check uses a QPathInfo call, which isn't
supported by ancient win9x era servers. Fall back to an older
SMBQueryInfo call if it fails with the magic error codes.
Cc: stable@kernel.org
Reported-and-Tested-by: Sandro Bonazzola <sandro.bonazzola@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
summarise_journal_usage seems to be obsolete for a long time,
so remove it.
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
In do_get_write_access() we wait on BH_Unshadow bit for buffer to get
from shadow state. The waking code in journal_commit_transaction() has
a bug because it does not issue a memory barrier after the buffer is moved
from the shadow state and before wake_up_bit() is called. Thus a waitqueue
check can happen before the buffer is actually moved from the shadow state
and waiting process may never be woken. Fix the problem by issuing proper
barrier.
CC: stable@kernel.org
Reported-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When ext2 mounts a filesystem, it attempts to set the block device
blocksize with a call to sb_set_blocksize, which can fail for
several reasons. The current failure message in ext2 prints:
EXT2-fs (loop1): error: blocksize is too small
which is not correct in all cases. This can be demonstrated
by creating a filesystem with
# mkfs.ext2 -b 8192
on a 4k page system, and attempting to mount it.
Change the error message to a more generic:
EXT2-fs (loop1): bad blocksize 8192
to match the error message in ext3.
Signed-off-by: Robin Dong <sanbai@taobao.com>
Reviewed-by: Coly Li <bosong.ly@taobao.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
If an application program does not make any changes to the indirect
blocks or extent tree, i_datasync_tid will not get updated. If there
are enough commits (i.e., 2**31) such that tid_geq()'s calculations
wrap, and there isn't a currently active transaction at the time of
the fdatasync() call, this can end up triggering a BUG_ON in
fs/jbd/commit.c:
J_ASSERT(journal->j_running_transaction != NULL);
It's pretty rare that this can happen, since it requires the use of
fdatasync() plus *very* frequent and excessive use of fsync(). But
with the right workload, it can.
We fix this by replacing the use of tid_geq() with an equality test,
since there's only one valid transaction id that is valid for us to
start: namely, the currently running transaction (if it exists).
CC: stable@kernel.org
Reported-by: Martin_Zielinski@McAfee.com
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
When make_indexed_dir() fails (e.g. because of ENOSPC) after it has allocated
block for index tree root, we did not properly mark all changed buffers dirty.
This lead to only some of these buffers being written out and thus effectively
corrupting the directory.
Fix the issue by marking all changed data dirty even in the error failure case.
CC: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
* use proc_mkdir_mode() instead of create_proc_entry(S_IFDIR|...),
export proc_mkdir_mode() for that, oh well.
* don't supply S_IFREG to proc_create_data(), it's unnecessary
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Currently after mount/remount operation on pstore filesystem,
the content on pstore will be lost. It is because current ERST
implementation doesn't support multi-user usage, which moves
internal pointer to the end after accessing it. Adding
multi-user support for pstore usage.
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
the return type of function _read_ in pstore is size_t,
but in the callback function of _read_, the logic doesn't
consider it too much, which means if negative value (assuming
error here) is returned, it will be converted to positive because
of type casting. ssize_t is enough for this function.
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This patch fixes an extremely rare mount failure after a power cut, when mount
fails with ENOSPC error because UBIFS could not find the GC LEB.
In short, the reason for this failure is that after recovery the GC head LEB
contains less free space than it had contained just before the power cut
happened. As a result, if the FS is full, 'ubifs_rcvry_gc_commit()' is unable
to find a dirty LEB to GC and a free LEB, so mount fails.
This patch contains a huge comment with more detailed explanation, please refer
that comment.
Since this is really really rare and unlikely situation, I do not send this
patch to the stable tree, also because it requires a lot of preparation
patches which I did before. So sending this to -stable would be too risky.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Further simplify 'ubifs_recover_leb()' by noticing that we have to call
'clean_buf()' in any case, and it is fine to call it if the offset is
aligned to 'c->min_io_size'. Thus, we do not have to call it separately
from every "if" - just call it once at the end.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Now when we call 'ubifs_recover_leb()' only for LEBs which are potentially
corrupted (i.e., only for last buds, not for all of them), we can cleanup every
LEB, not only those where we find corruption. The reason - unstable bits. Even
though the LEB may look good now, it might contain unstable bits which may hit
us a bit later.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This patch cleans up 'ubifs_recover_leb()' function and makes it more readable.
Move things which are done only once out of the loop and kill unneeded 'switch'
statement.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
If a UBIFS filesystem is being mounted read-write, or is being remounted
from read-only to read-write, check for the "space_fixup" flag and fix
all LEBs containing empty space if necessary.
Artem: tweaked the patch a bit
Signed-off-by: Matthew L. Creech <mlcreech@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This patch adds the 'ubifs_fixup_free_space()' function which scans all
LEBs in the filesystem for those that are in-use but have one or more
empty pages, then re-maps the LEBs in order to erase the empty portions.
Afterward it removes the "space_fixup" flag from the UBIFS superblock.
Artem: massaged the patch
Signed-off-by: Matthew L. Creech <mlcreech@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The 'space_fixup' flag can be set in the superblock of a new filesystem by
mkfs.ubifs to indicate that any eraseblocks with free space remaining should be
fixed-up the first time it's mounted (after which the flag is un-set). This
means that the UBIFS image has been flashed by a "dumb" flasher and the free
space has been actually programmed (writing all 0xFFs), so this free space
cannot be used. UBIFS fixes the free space up by re-writing the contents of all
LEBs with free space using the atomic LEB change UBI operation.
Artem: improved commit message, add some more commentaries to the code.
Signed-off-by: Matthew L. Creech <mlcreech@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
We'll need to use the 'next_log_lnum()' helper function from log.c in the fixup
code, so let's move it to misc.h. IOW, this is a preparation to the following
free space fixup changes.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This patch improves UBIFS recovery and teaches it to expect corruption only
in the last buds. Indeed, currently we just recover all buds, which is
incorrect because only the last buds can have corruptions in case of a power
cut. So it is inconsistent with the rest of the recovery strategy which tries
hard to distinguish between corruptions cause by power cuts and other types of
corruptions.
This patch also adds one quirk - a bit older UBIFS was could have corruption in
the next to last bud because of the way it switched buds: when bud A is full,
it first searched for the next bud B, the wrote a reference node to the log
about B, and then synchronized the write-buffer of A. So we could end up with
buds A and B, where B is the last, but A had corruption. The UBIFS behavior
was fixed, though, so currently it always first synchronizes A's write-buffer
and only after this adds B to the log. However, to be make sure that we handle
unclean (after a power cut) UBIFS images belonging to older UBIFS - we need to
add a quirk and keep it for some time: we need to check for the situation
described above.
Thankfully, it is easy to check for that situation. When UBIFS adds B to the
log, it always first unmaps B, then maps it, and then syncs A's write-buffer.
Thus, in that situation we can check that B is empty, in which case it is OK to
have corruption in A. To check that B is empty it is enough to just read the
first few bytes of the bud and compare them with 0xFFs. This quirk may be
removed in a couple of years.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Currently when UBIFS fills up the current bud (which is the last in the journal
head) and switches to the next bud, it first writes the log reference node for
the next bud and only after this synchronizes the write-buffer of the previous
bud. This is not a big deal, but an unclean power cut may lead to a situation
when we have corruption in a next-to-last bud, although it is much more logical
that we have to have corruption only in the last bud.
This patch also removes write-buffer synchronization from
'ubifs_wbuf_seek_nolock()' because this is not needed anymore (we synchronize
the write-buffer explicitly everywhere now) and also because this is just
prone to various errors.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Remove a 'BUG()' statement when we are unable to find a bud and add a
similar 'ubifs_assert()' statement instead.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This is a minor preparation patch which changes 'replay_bud()' interface -
instead of passing bud lnum, offs, jhead, etc directly, pass a pointer to the
bud entry which contains all the information. The bud entry will be also needed
in one of the following patches.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This patch simplifies replay even further - it removes the replay tree and
adds the replay list instead. Indeed, we just do not need to use a tree here -
all we need to do is to add all nodes to the list and then sort it. Using
RB-tree is an overkill - more code and slower. And since we replay buds in
order, we expect the nodes to follow in _mostly_ sorted order, so the merge
sort becomes much cheaper in average than an RB-tree.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This patch simplifies the replay code and makes it smaller. First of all, we
can notice that we do not really need to create bud replay entries and insert
them to the replay tree, because the only reason we do this is to set buds
lprops correctly at the end. Instead, we can just walk the list of buds at the
very end and set lprops for each bud. This allows us to get rid of whole
'insert_ref_node()' function, the 'REPLAY_REF' flag, and several fields in
'struct replay_entry'. Then we can also notice that we do not need the 'flags'
'struct replay_entry' field, because there is only one flag -
'REPLAY_DELETION'. Instead, we can just add a 'deletion' bit fields. As a
result, this patch deletes much more lines that in adds.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This is just a small preparation patch which adds 'free' and 'drity' fields to
'struct bud_entry'. They will be used to set bud lprops.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This is patch removes an unnecessary 'offs' variable from 'ubifs_wbuf_write_nolock()'
- we can just keep 'wbuf->offs' up-to-date instead. This patch is very minor
the only motivation for it was that it is cleaner to keep wbuf->offs up-to-date
by the time we call 'ubifs_leb_write()'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Commit 52c6e6f990 provides misleading infomation
in the commit messages - buds are replied in order. And the real reason why
that fix helped is probably because it made sure we seek head even in read-only
mode (so deferred recovery will have seeked heads).
This patch adds an assertion which will fire if we reply buds out of order.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This is a minor change which makes 2 functions static because they
are not used outside the gc.c file: 'data_nodes_cmp()' and
'nondata_nodes_cmp()'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Now we return all errors from 'scan_check_cb()' directly, so we do not need
'struct scan_check_data' any more, and this patch removes it.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Simplify error path in 'scan_check_cb()' and stop using the special 'data->err'
field, but instead return the error code directly.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
When doing the lprops extra check ('dbg_check_lprops()') we scan whole media.
We even scan empty and freeable LEBs which may contain garbage, which we handle
after scanning. This patch teach the lprops checking function
('scan_check_cb()') to avoid scanning for free and freeable LEBs and save time.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Steps to reproduce the bug:
- Call FS_IOC_SETLFAGS ioctl with flags=FS_COMPR_FL
- Call FS_IOC_SETFLAGS ioctl with flags=0
- Call FS_IOC_GETFLAGS ioctl, and you'll see FS_COMPR_FL is still set!
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
FS_COW_FL and FS_NOCOW_FL were newly introduced to control per file
COW in btrfs, but FS_NOCOW_FL is sufficient.
The fact is we don't have corresponding BTRFS_INODE_COW flag.
COW is default, and FS_NOCOW_FL can be used to switch off COW for
a single file.
If we mount btrfs with nodatacow, a newly created file will be set with
the FS_NOCOW_FL flag. So to turn on COW for it, we can just clear the
FS_NOCOW_FL flag.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a btrfs disk is created by mixed data & metadata option, it will have no
pure data or pure metadata space info.
In btrfs's for-linus branch, commit 78b1ea13838039cd88afdd62519b40b344d6c920
(Btrfs: fix OOPS of empty filesystem after balance) initializes space infos at
the very beginning. The problem is this initialization does not take the mixed
case into account, which will cause btrfs will easily get into ENOSPC in mixed
case.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If posix_acl_from_xattr() returns an error code, a negative address is
dereferenced causing an oops; fix by checking for error code first.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This allows drivers who call this function to be compiled modularly.
Otherwise, a driver who is interested in this type of functionality
has to implement their own get_task_comm() call, causing code
duplication in the Linux source tree.
Signed-off-by: J Freyensee <james_p_freyensee@linux.intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Enabling DEBUG_STRICT_USER_COPY_CHECKS causes the following
warning:
In file included from arch/x86/include/asm/uaccess.h:573,
from include/linux/uaccess.h:5,
from include/linux/highmem.h:7,
from include/linux/pagemap.h:10,
from fs/debugfs/file.c:18:
In function 'copy_from_user',
inlined from 'write_file_bool' at fs/debugfs/file.c:435:
arch/x86/include/asm/uaccess_64.h:65: warning: call to
'copy_from_user_overflow' declared with attribute warning:
copy_from_user() buffer size is not provably correct
presumably due to buf_size being signed causing GCC to fail to
see that buf_size can't become negative.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
On some arches (x86, sh, arm, unicore, powerpc) the oops message would
print out the last sysfs file accessed.
This was very useful in finding a number of sysfs and driver core bugs
in the 2.5 and early 2.6 development days, but it has been a number of
years since this file has actually helped in debugging anything that
couldn't also be trivially determined from the stack traceback.
So it's time to delete the line. This is good as we need all the space
we can get for oops messages at times on consoles.
Acked-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFSv4.1: Ensure that layoutget uses the correct gfp modes
NFSv4.1: remove pnfs_layout_hdr from pnfs_destroy_all_layouts tmp_list
NFSv41: Resend on NFS4ERR_RETRY_UNCACHED_REP
It's a hot function, and we're better off not mixing types in the mask
calculations. The compiler just ends up mixing 16-bit and 32-bit
operations, for no good reason.
So do everything in 'unsigned int' rather than mixing 'unsigned int'
masking with a 'umode_t' (16-bit) mode variable.
This, together with the parent commit (47a150edc2: "Cache user_ns in
struct cred") makes acl_permission_check() much nicer.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During resource migration, if the target node were to die, the thread doing
the migration spins until the target node is not removed from the domain map.
This patch slows the spin by making the thread wait for the recovery to kick in.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Patch skips mount recovery for hard-ro mounts which otherwise leads to an oops.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
If o2hb finds unexpected values in the heartbeat slot, it prints a message
"ERROR: Device "dm-6": another node is heartbeating in our slot!"
This message could be misleading. This patch adds two more messages to
help users better diagnose the problem.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
We have seen isolated cases (very few, I might add) of o2hb not detecting all
live nodes on startup. One plausible reasoning for it is that other node had
a hb io delay at the same time. The live threshold set at 2 (as low as it can
be) could be increased to ameliorate the situation.
But increasing the threshold directly affects mount time. Currently it takes
around 5 secs to mount a volume in o2cb cluster with local heartbeat. Increasing
the threshold will make mounts even slower. As the issue itself is rare, we have
left things as they are for the local heartbeat mode.
However we can improve the situation for global heartbeat mode as in that mode,
we start the heartbeat much before the mount.
This patch doubles the live threshold for the start of the first region in
global heartbeat mode.
Addresses internal Oracle bug#10635585.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Patch fixes a bug in the o2dlm protocol negotiation in that it is using
the builtin version rather than the negotiated version during the domain
join. This causes join errors when a node having kernel >= 2.6.37 joins
a cluster with nodes having kernels < 2.6.37.
This only affects the o2cb cluster stack.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Reported-by: Jacek Stepniewski <Jacek.Stepniewski@agora.pl>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
In the case of removing a partial extent record which covers a hole, current
punching-hole logic will try to remove more than the length of whole extent
record, which leads to the failure of following assert(fs/ocfs2/alloc.c):
5507 BUG_ON(cpos < le32_to_cpu(rec->e_cpos) || trunc_range > rec_range);
This patch tries to skip existing hole at the last attempt of removing a partial
extent record, what's more, it also adds some necessary comments for better
understanding of punching-hole codes.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
CLANG found that there is a path that has data_ac uninitialized,
this place
2917 /* This gets us the dx_root */
2918 ret = ocfs2_reserve_new_metadata_blocks(osb, 1, &meta_ac);
2919 if (ret) {
3
Taking true branch
2920 mlog_errno(ret);
2921 goto out;
4
Control jumps to line 3168
2922 }
Goes to the out: label without data_ac being initialized.
Ciao, Marcus
Signed-Off-By: Marcus Meissner <meissner@suse.de>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
When re-mounting from R/O mode to R/W mode and the LEB count in the superblock
is not up-to date, because for the underlying UBI volume became larger, we
re-write the superblock. We allocate RAM for these purposes, but never free it.
So this is a memory leak, although very rare one.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
This patch fixes a problem with the following symptoms:
UBIFS: deferred recovery completed
UBIFS error (pid 15676): dbg_check_synced_i_size: ui_size is 11481088, synced_i_size is 11459081, but inode is clean
UBIFS error (pid 15676): dbg_check_synced_i_size: i_ino 128, i_mode 0x81a4, i_size 11481088
It happens when additional debugging checks are enabled and we are recovering
from a power cut. When we fixup corrupted inode size during recovery, we change
them in-place and we change ui_size as well, but not synced_i_size, which
causes this failure. This patch makes sure we change both fields and fixes the
issue.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
When the debugging self-checks are enabled, we go trough whole file-system
after mount and check/validate every single node referred to by the index.
This is implemented by the 'dbg_check_filesystem()' function. However, this
function fails if we mount "unclean" file-system, i.e., if we mount the
file-system after a power cut. It fails with the following symptoms:
UBIFS DBG (pid 8171): ubifs_recover_size: ino 937 size 3309925 -> 3317760
UBIFS: recovery deferred
UBIFS error (pid 8171): check_leaf: data node at LEB 1000:0 is not within inode size 3309925
The reason of failure is that recovery fixed up the inode size in memory, but
not on the flash so far. So the value on the flash is incorrect so far,
and would be corrected when we re-mount R/W. But 'check_leaf()' ignores
this fact and tries to validate the size of the on-flash inode, which is
incorrect, so it fails.
This patch teaches the checking code to look at the VFS inode cache first,
and if there is the inode in question, use that inode instead of the inode
on the flash media. This fixes the issue.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
In 'ubifs_recover_size()' we have an "if (!e->inode && c->ro_mount)" statement.
But if 'c->ro_mount' is true, then '!e->inode' must always be true as well. So
we can remove the unnecessary '!e->inode' test and put an
'ubifs_assert(!e->inode)' instead.
This patch also removes an extra trailing white-space in a debugging print,
as well as adds few empty lines to 'ubifs_recover_size()' to make it a bit more
readable.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
When recovering the inode size, one of the debugging messages was printed
incorrecly, this patches fixes it.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This commits refactors and cleans up 'ubifs_rcvry_gc_commit()' which was quite
untidy, also removes the commentary which was not 100% correct.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Split the 'ubifs_rcvry_gc_commit()' function and introduce a 'grab_empty_leb()'
heler. This cleans 'ubifs_rcvry_gc_commit()' a little and makes it a bit less
of spagetti.
Also, add a commentary which explains why it is crucial to first search for an
empty LEB and then run commit.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
When UBIFS is in the failure mode (used for power cut emulation testing) we for
some reasons do not dump the stack in many places, e.g., in assertions.
Probably at early days we had too many of them and disabled this to make the
development easier, but then never enabled. Nowadays I sometimes observe
assertion failures during power cut testing, but the useful stackdump is not
printed, which is bad. This patch makes UBIFS always print the stackdump when
debugging is enabled.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
If we fail to recover the gc_lnum we just return an error and it then
it is difficult to figure out why this happened. This patch adds useful
debugging information which should make it easier to debug the failure.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This patch removes a piece of code in 'ubifs_rcvry_gc_commit()' which is never
executed. We call 'ubifs_find_dirty_leb()' function with min_space =
wbuf->offs, so if it returns us an LEB, it is guaranteed to have at lease
'wbuf->offs' bytes of free+dirty space. So we can remove the subsequent code
which deals with "returned LEB has less than 'wbuf->offs' bytes of free+dirty
space". This simplifies 'ubifs_rcvry_gc_commit()' a little.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
We have duplicated code in 'ubifs_garbage_collect()' and
'ubifs_rcvry_gc_commit()', which is about handling the special case of free
LEB. In both cases we just want to garbage-collect the LEB using
'ubifs_garbage_collect_leb()'.
This patch teaches 'ubifs_garbage_collect_leb()' to handle free LEB's so that
the caller does not have to do this and the duplicated code is removed.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Remove the following commentary from 'ubifs_file_mmap()':
/* 'generic_file_mmap()' takes care of NOMMU case */
I do not understand what it means, and I could not find anything relater to
NOMMU in 'generic_file_mmap()'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This patch is a tiny improvement which removes few bytes of code.
UBIFS debugfs files are non-seekable and the file position is ignored,
so do not increase it in the write handler.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The 'dbg_dump_lprop()' is trying to detect journal head LEBs when printing,
so it looks at the write-buffers. However, if we are in R/O mode, we
de-allocate the write-buffers, so 'dbg_dump_lprop()' oopses. This patch fixes
the issue.
Note, this patch is not critical, it is only about the debugging code path, and
it is unlikely that anyone but UBIFS developers would ever hit this issue.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
We have our own flags indicating R/O mode, and c->ro_mode is equivalent
to MS_RDONLY. Let's be consistent and use UBIFS flags everywhere.
This patch is just a minor cleanup.
Additionally, add a comment that we are surprised with VFS behavior -
as a reminder to look at this some day.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
When the debugging failure emulation is enabled and UBIFS decides to
emulate an I/O error, it uses EIO error code. In which case UBIFS
switches into R/O mode later on. The for the user-space is that when
a failure is emulated, the file-system sometimes returns EIO and
sometimes EROFS. This makes it more difficult to implement user-space
tests for the failure mode. Let's be consistent and return EROFS in
all the cases.
This patch is an improvement for the debugging code and does not affect
the functionality at all.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This is just a tiny clean-up patch. The variable name for empty address
space operations is "empty_aops". Let's use consistent names for empty
inode and file operations: "empty_iops" and "empty_fops", instead of
inconsistent "none_inode_operations" and "none_file_operations".
Artem: re-write the commit message.
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Try to improve UBIFS testing coverage by randomly picking LEBs to
store in lsave, rather than picking them optimally. Create a debugging
version of 'populate_lsave()' for these purposes and enable it when
general debugging self-checks are enabled.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
UBIFS can force itself to use the 'in-the-gaps' commit method - the last resort
method which is normally invoced very very rarely. Currently this "force
int-the-gaps" debugging feature is a separate test mode. But it is a bit saner
to make it to be the "general" self-test check instead.
This patch is just a clean-up which should make the debugging code look a bit
nicer and easier to use - we have way too many debugging options.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This patch improves the 'dbg_check_space_info()' function which checks
whether the amount of space before re-mounting and after re-mounting
is the same (remounting from R/O to R/W modes and vice-versa).
The problem is that 'dbg_check_space_info()' does not save the budgeting
information before re-mounting, so when an error is reported, we do not
know why the amount of free space changed.
This patches makes the following changes:
1. Teaches 'dbg_dump_budg()' function to accept a 'struct ubifs_budg_info'
argument and print out the this argument. This way we may ask it to
print any saved budgeting info, no only the current one.
2. Accordingly changes all the callers of 'dbg_dump_budg()' to comply with
the changed interface.
3. Introduce a 'saved_bi' (saved budgeting info) field to
'struct ubifs_debug_info' and save the budgeting info before re-mounting
there.
4. Change 'dbg_check_space_info()' and make it print both old and new
budgeting information.
5. Additionally, save 'c->igx_gc_cnt' and print it if and error happens. This
value contributes to the amount of free space, so we have to print it.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Re-arrange the budget dump and make sure we first dump all
the 'struct ubifs_budg_info' fields, and then the other information.
Additionally, print the 'uncommitted_idx' variable.
This change is required for to the following dumping function
enhancement where it will be possible to dump saved
'struct ubifs_budg_info' objects, not only the current one.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The current 'dbg_dump_budg()' calling convention is that the
'c->space_lock' spinlock is held. However, none of the callers
actually use it from contects which have 'c->space_lock' locked,
so all callers have to explicitely lock and unlock the spinlock.
This is not very sensible convention. This patch changes it and
makes 'dbg_dump_budg()' lock the spinlock instead of imposing this
to the callers. This simplifies the code a little.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
This patch separates out all the budgeting-related information
from 'struct ubifs_info' to 'struct ubifs_budg_info'. This way the
code looks a bit cleaner. However, the main driver for this is
that we want to save budgeting information and print it later,
so a separate data structure for this is helpful.
This patch is a preparation for the further debugging output
improvements.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
There was an attempt to standartize various "__attribute__" and
other macros in order to have potentially portable and more
consistent code, see commit 82ddcb0405.
Note, that commit refers Rober Love's blog post, but the URL
is broken, the valid URL is:
http://blog.rlove.org/2005/10/with-little-help-from-your-compiler.html
Moreover, nowadays checkpatch.pl warns about using
__attribute__((packed)):
"WARNING: __packed is preferred over __attribute__((packed))"
It is not a big deal for UBIFS to use __packed, so let's do it.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Fix several minor stylistic issues:
* lines longer than 80 characters
* space before closing parenthesis ')'
* spaces in the indentations
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Turn the debufs files UBIFS maintains into non-seekable. Indeed, none
of them is supposed to be seek'ed.
Do this by making the '.lseek()' handler to be 'no_llseek()' and by
using 'nonseekable_open()' in the '.open()' operation.
This does mean an API break but this debugging API is only used by a couple
of test scripts which do not rely in the 'llseek()' operation.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Now that there are no longer any exceptions to the normal inode
creation code path, we can move the parts of the locking code
which were duplicated in mkdir/mknod/create/symlink into the
inode create function.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This moves the symlink specific parts of inode creation
into the function where we initialise the rest of the
dinode. As a result we have one less place where we need
to look up the inode's buffer.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This moves the initialisation of the directory into the inode
creation functions to avoid having to duplicate the lookup
of the inode's buffer.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Currently, writebacks may end up recursing back into the filesystem due to
GFP_KERNEL direct reclaims in the pnfs subsystem.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: do not use i_wrbuffer_ref as refcount for Fb cap
ceph: fix list_add in ceph_put_snap_realm
ceph: print debug message before put mds session
Prevents an infinite loop as list was never emptied.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Free the slot and resend the RPC with new session <slot#,seq#>.
For nfs4_async_handle_error, return -EAGAIN and set the task->tk_status to 0
to restart the async rpc in the rpc_restart_call_prepare state which resets
the slot.
For nfs4_handle_exception, retrying a call that uses nfs4_call_sync will
reset the slot via nfs41_call_sync_prepare.
For open/close/lock/locku/delegreturn/layoutcommit/unlink/rename/write
cachethis is true, so these operations will not trigger an
NFS4ERR_RETRY_UNCACHED_REP.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We increments i_wrbuffer_ref when taking the Fb cap. This breaks
the dirty page accounting and causes looping in
__ceph_do_pending_vmtruncate, and ceph client hangs.
This bug can be reproduced occasionally by running blogbench.
Add a new field i_wb_ref to inode and dedicate it to Fb reference
counting.
Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
The mds session, s, could be freed during ceph_put_mds_session.
Move dout before ceph_put_mds_session.
Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Implementing file descriptors for the network namespace
is simple and straight forward.
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Create files under /proc/<pid>/ns/ to allow controlling the
namespaces of a process.
This addresses three specific problems that can make namespaces hard to
work with.
- Namespaces require a dedicated process to pin them in memory.
- It is not possible to use a namespace unless you are the child
of the original creator.
- Namespaces don't have names that userspace can use to talk about
them.
The namespace files under /proc/<pid>/ns/ can be opened and the
file descriptor can be used to talk about a specific namespace, and
to keep the specified namespace alive.
A namespace can be kept alive by either holding the file descriptor
open or bind mounting the file someplace else. aka:
mount --bind /proc/self/ns/net /some/filesystem/path
mount --bind /proc/self/fd/<N> /some/filesystem/path
This allows namespaces to be named with userspace policy.
It requires additional support to make use of these filedescriptors
and that will be comming in the following patches.
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Fix what is clearly a simple copy-and-paste error in commenting the
sysfs_update_group() routine.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: fix race condition in AIL push trigger
xfs: make AIL target updates and compares 32bit safe.
xfs: always push the AIL to the target
xfs: exit AIL push work correctly when AIL is empty
xfs: ensure reclaim cursor is reset correctly at end of AG
Some cases (e.g. ecryptfs) can call ->dentry_revalidate with NULL
nameidata.
https://bugzilla.kernel.org/show_bug.cgi?id=34732
Tyler Hicks pointed out that this bug was introduced by commit
e7c0a16786 "fuse: make fuse_dentry_revalidate() RCU aware"
Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The VFS superblock structure now has a UUID field, so we can use that
in preference to the UUID field in the GFS2 superblock now.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This replaces nilfs_mdt_mark_buffer_dirty and nilfs_btnode_mark_dirty
macros with mark_buffer_dirty and gets rid of nilfs_mark_buffer_dirty,
an own mark buffer dirty function.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
In the current nilfs, page cache for btree nodes and meta data files
do not set a valid back pointer to the host inode in mapping->host.
This will change it so that every address space in nilfs uses
mapping->host to hold its host inode.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This replaces all references of NILFS_I_NILFS(inode)->ns_bdev with
inode->i_sb->s_bdev and unfolds remaining uses of NILFS_I_NILFS inline
function.
Before 2.6.37, referring to a nilfs object from inodes needed a
conditional judgement, and NILFS_I_NILFS was helpful to simplify it.
But now we can simply do it by going through a super block instance
like inode->i_sb->s_fs_info.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This uses list_first_entry macro instead of list_entry if it's used to
get the first entry.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
When shrinking the filesystem, segments to be truncated must be test
if they are busy or not, and unneeded sufile block should be deleted.
This adds routines for the truncation.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
After resizing the filesystem, the secondary super block must be moved
to a new location. This adds a helper function for this.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This adds a new ioctl command which limits range of segment to be
allocated. This is intended to gather data whithin a range of the
partition before shrinking the filesystem, or to control new log
location for some purpose.
If a range is specified by the ioctl, segment allocator of nilfs tries
to allocate new segments from the range unless no free segments are
available there.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The super root block is newly-allocated each time it is written back
to disk, so unused portion of the block should be cleared.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The size of super root structure depends on inode size, so
NILFS_SR_BYTES macro should be a function of the inode size. This
fixes the issue.
Even though a different size value will be written for a possible
future filesystem with extended inode, but fortunately this does not
break disk format compatibility.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Previously, nilfs was cloning pages for mmapped region to freeze their
data and ensure consistency of checksum during writeback cycles. A
private page allocator was used for this page cloning. But, we no
longer need to do that since clear_page_dirty_for_io function sets up
pte so that vm_ops->page_mkwrite function is called right before the
mmapped pages are modified and nilfs_page_mkwrite function can safely
wait for the pages to be written back to disk.
So, this stops making a copy of mmapped pages during writeback, and
eliminates the private page allocation and deallocation functions from
nilfs.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Merge list_del() + list_add_tail() to list_move_tail().
Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
After having applied commit 9954e7af14 ("nilfs2: add free
entries count only if clear bit operation succeeded"), a free routine
of nilfs came to fall into an infinite loop, outputting the same
message endlessly:
nilfs_palloc_freev: entry number 29497 already freed
nilfs_palloc_freev: entry number 29497 already freed
nilfs_palloc_freev: entry number 29497 already freed
nilfs_palloc_freev: entry number 29497 already freed
nilfs_palloc_freev: entry number 29497 already freed ...
That patch broke the routine so that a loop counter is never updated
in an abnormal state. This fixes the regression.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This is the final part of the ops_inode.c/inode.c reordering. We
are left with a single file called inode.c which now contains
all the inode operations, as expected.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. One is caused by a
race condition in determining whether there is a psh in progress or
not.
The XFS_AIL_PUSHING_BIT is used to determine whether a push is
currently in progress. When the AIL push work completes, it checked
whether the target changed and cleared the PUSHING bit to allow a
new push to be requeued. The race condition is as follows:
Thread 1 push work
smp_wmb()
smp_rmb()
check ailp->xa_target unchanged
update ailp->xa_target
test/set PUSHING bit
does not queue
clear PUSHING bit
does not requeue
Now that the push target is updated, new attempts to push the AIL
will not trigger as the push target will be the same, and hence
despite trying to push the AIL we won't ever wake it again.
The fix is to ensure that the AIL push work clears the PUSHING bit
before it checks if the target is unchanged.
As a result, both push triggers operate on the same test/set bit
criteria, so even if we race in the push work and miss the target
update, the thread requesting the push will still set the PUSHING
bit and queue the push work to occur. For safety sake, the same
queue check is done if the push work detects the target change,
though only one of the two will will queue new work due to the use
of test_and_set_bit() checks.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
(cherry picked from commit e4d3c4a43b)
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. One of the problems
noticed was that updates of the push target are not 32 bit safe as
the target is a 64 bit value.
We cannot copy a 64 bit LSN without the possibility of corrupting
the result when racing with another updating thread. We have
function to do this update safely without needing to care about
32/64 bit issues - xfs_trans_ail_copy_lsn() - so use that when
updating the AIL push target.
Also move the reading of the target in the push work inside the AIL
lock, and use XFS_LSN_CMP() for the unlocked comparison during work
termination to close read holes as well.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
(cherry picked from commit fd5670f22f)
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. One of the problems
discovered is a target mismatch between the item pushing loop and
the target itself.
The push trigger checks for the target increasing (i.e. new target >
current) while the push loop only pushes items that have a LSN <
current. As a result, we can get the situation where the push target
is X, the items at the tail of the AIL have LSN X and they don't get
pushed. The push work then completes thinking it is done, and cannot
be restarted until the push target increases to >= X + 1. If the
push target then never increases (because the tail is not moving),
then we never run the push work again and we stall.
Fix it by making sure log items with a LSN that matches the target
exactly are pushed during the loop.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
(cherry picked from commit cb64026b6e)
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. The main cause is a
regression where a work exit path fails to clear the PUSHING state
and recheck the target correctly.
Make both exit paths do the same PUSHING bit clearing and target
checking when the "no more work to be done" condition is hit.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
(cherry picked from commit ea35a20021)
On a 32 bit highmem PowerPC machine, the XFS inode cache was growing
without bound and exhausting low memory causing the OOM killer to be
triggered. After some effort, the problem was reproduced on a 32 bit
x86 highmem machine.
The problem is that the per-ag inode reclaim index cursor was not
getting reset to the start of the AG if the radix tree tag lookup
found no more reclaimable inodes. Hence every further reclaim
attempt started at the same index beyond where any reclaimable
inodes lay, and no further background reclaim ever occurred from the
AG.
Without background inode reclaim the VM driven cache shrinker
simply cannot keep up with cache growth, and OOM is the result.
While the change that exposed the problem was the conversion of the
inode reclaim to use work queues for background reclaim, it was not
the cause of the bug. The bug was introduced when the cursor code
was added, just waiting for some weird configuration to strike....
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-By: Christian Kujau <lists@nerdbynature.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
(cherry picked from commit b223221956)
Linux kernel excludes guard page when performing mlock on a VMA with
down-growing stack. However, some architectures have up-growing stack
and locking the guard page should be excluded in this case too.
This patch fixes lvm2 on PA-RISC (and possibly other architectures with
up-growing stack). lvm2 calculates number of used pages when locking and
when unlocking and reports an internal error if the numbers mismatch.
[ Patch changed fairly extensively to also fix /proc/<pid>/maps for the
grows-up case, and to move things around a bit to clean it all up and
share the infrstructure with the /proc bits.
Tested on ia64 that has both grow-up and grow-down segments - Linus ]
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Tested-by: Tony Luck <tony.luck@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. One is caused by a
race condition in determining whether there is a psh in progress or
not.
The XFS_AIL_PUSHING_BIT is used to determine whether a push is
currently in progress. When the AIL push work completes, it checked
whether the target changed and cleared the PUSHING bit to allow a
new push to be requeued. The race condition is as follows:
Thread 1 push work
smp_wmb()
smp_rmb()
check ailp->xa_target unchanged
update ailp->xa_target
test/set PUSHING bit
does not queue
clear PUSHING bit
does not requeue
Now that the push target is updated, new attempts to push the AIL
will not trigger as the push target will be the same, and hence
despite trying to push the AIL we won't ever wake it again.
The fix is to ensure that the AIL push work clears the PUSHING bit
before it checks if the target is unchanged.
As a result, both push triggers operate on the same test/set bit
criteria, so even if we race in the push work and miss the target
update, the thread requesting the push will still set the PUSHING
bit and queue the push work to occur. For safety sake, the same
queue check is done if the push work detects the target change,
though only one of the two will will queue new work due to the use
of test_and_set_bit() checks.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. One of the problems
noticed was that updates of the push target are not 32 bit safe as
the target is a 64 bit value.
We cannot copy a 64 bit LSN without the possibility of corrupting
the result when racing with another updating thread. We have
function to do this update safely without needing to care about
32/64 bit issues - xfs_trans_ail_copy_lsn() - so use that when
updating the AIL push target.
Also move the reading of the target in the push work inside the AIL
lock, and use XFS_LSN_CMP() for the unlocked comparison during work
termination to close read holes as well.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. One of the problems
discovered is a target mismatch between the item pushing loop and
the target itself.
The push trigger checks for the target increasing (i.e. new target >
current) while the push loop only pushes items that have a LSN <
current. As a result, we can get the situation where the push target
is X, the items at the tail of the AIL have LSN X and they don't get
pushed. The push work then completes thinking it is done, and cannot
be restarted until the push target increases to >= X + 1. If the
push target then never increases (because the tail is not moving),
then we never run the push work again and we stall.
Fix it by making sure log items with a LSN that matches the target
exactly are pushed during the loop.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. The main cause is a
regression where a work exit path fails to clear the PUSHING state
and recheck the target correctly.
Make both exit paths do the same PUSHING bit clearing and target
checking when the "no more work to be done" condition is hit.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
On a 32 bit highmem PowerPC machine, the XFS inode cache was growing
without bound and exhausting low memory causing the OOM killer to be
triggered. After some effort, the problem was reproduced on a 32 bit
x86 highmem machine.
The problem is that the per-ag inode reclaim index cursor was not
getting reset to the start of the AG if the radix tree tag lookup
found no more reclaimable inodes. Hence every further reclaim
attempt started at the same index beyond where any reclaimable
inodes lay, and no further background reclaim ever occurred from the
AG.
Without background inode reclaim the VM driven cache shrinker
simply cannot keep up with cache growth, and OOM is the result.
While the change that exposed the problem was the conversion of the
inode reclaim to use work queues for background reclaim, it was not
the cause of the bug. The bug was introduced when the cursor code
was added, just waiting for some weird configuration to strike....
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-By: Christian Kujau <lists@nerdbynature.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
* hpfs:
HPFS: Remove unused variable
HPFS: Move declaration up, so that there are no out-of-scope pointers
HPFS: Fix some unaligned accesses
HPFS: Fix endianity. Make hpfs work on big-endian machines
HPFS: Implement fsync for hpfs
HPFS: Fix a bug that filesystem was not marked dirty when remounting it
HPFS: Restrict uid and gid to 16-bit values
HPFS: When marking or clearing the dirty bit, sync the filesystem
HPFS: Use types with defined width
HPFS: Remove mark_inode_dirty
HPFS: Remove CR/LF conversion option
HPFS: Remove remaining locks
HPFS: Introduce a global mutex and lock it on every callback from VFS.
HPFS: Make HPFS compile on preempt and SMP
Move declaration up, so that there are no out-of-scope pointers
Reported-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a bug that filesystem was not marked dirty when remounting it
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Restrict uid and gid to 16-bit values.
HPFS stores only 2 bytes in the EAs.
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When marking or clearing the dirty bit, sync the filesystem
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use types with defined width
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>