Fix the race between invalidate_inode_pages and do_no_page.
Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.
The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache. Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.
The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.
Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).
Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size. do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size). The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data. Valid mappings to the same
place will see a different page.
Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit. He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight. However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit. Scalability is not an issue.
This patch implements this latter approach. ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so). do_no_page only unlocks the page after setting up the mapping
completely. invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).
This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: (24 commits)
[CIFS] merge conflict in fs/cifs/export.c
[CIFS] Allow disabling CIFS Unix Extensions as mount option
[CIFS] More whitespace/formatting fixes (noticed by checkpatch)
[CIFS] Typo in previous patch
[CIFS] zero_user_page() conversions
[CIFS] use simple_prepare_write to zero page data
[CIFS] Fix build break - inet.h not included when experimental ifdef off
[CIFS] Add support for new POSIX unlink
[CIFS] whitespace/formatting fixes
[CIFS] Fix oops in cifs_create when nfsd server exports cifs mount
[CIFS] whitespace cleanup
[CIFS] Fix packet signatures for NTLMv2 case
[CIFS] more whitespace fixes
[CIFS] more whitespace cleanup
[CIFS] whitespace cleanup
[CIFS] whitespace cleanup
[CIFS] ipv6 support no longer experimental
[CIFS] Mount should fail if server signing off but client mount option requires it
[CIFS] whitespace fixes
[CIFS] Fix sign mount option and sign proc config setting
...
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/docs-2.6:
zh_CN/HOWTO: update URLs of git trees
Chinese translation of Documentation/stable_api_nonsense.txt
HOWTO: add Chinese translation of Documentation/HOWTO
Documentation: add Japanese translated stable_api_nonsense.txt
HOWTO: add Japanese translation of Documentation/HOWTO
* 'for-linus' of git://linux-nfs.org/~bfields/linux:
locks: fix vfs_test_lock() comment
locks: make posix_test_lock() interface more consistent
nfs: disable leases over NFS
gfs2: stop giving out non-cluster-coherent leases
locks: export setlease to filesystems
locks: provide a file lease method enabling cluster-coherent leases
locks: rename lease functions to reflect locks.c conventions
locks: share more common lease code
locks: clean up lease_alloc()
locks: convert an -EINVAL return to a BUG
leases: minor break_lease() comment clarification
* 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband: (29 commits)
IB/mthca: Simplify use of size0 in work request posting
IB/mthca: Factor out setting WQE UD segment entries
IB/mthca: Factor out setting WQE remote address and atomic segment entries
IB/mlx4: Factor out setting other WQE segments
IB/mlx4: Factor out setting WQE data segment entries
IB/mthca: Factor out setting WQE data segment entries
IB/mlx4: Return receive queue sizes for userspace QPs from query QP
IB/mlx4: Increase max outstanding RDMA reads as target
RDMA/cma: Remove local write permission from QP access flags
IB/mthca: Use uninitialized_var() for f0
IB/cm: Make internal function cm_get_ack_delay() static
IB/ipath: Remove ipath_get_user_pages_nocopy()
IB/ipath: Make a few functions static
mlx4_core: Reset device when internal error is detected
IB/iser: Make a couple of functions static
IB/mthca: Fix printk format used for firmware version in warning
IB/mthca: Schedule MSI support for removal
IB/ehca: Fix warnings issued by checkpatch.pl
IB/ehca: Restructure ehca_set_pagebuf()
IB/ehca: MR/MW structure refactoring
...
Previously the only way to do this was to umount all mounts to that server,
turn off a proc setting (/proc/fs/cifs/LinuxExtensionsEnabled).
Fixes Samba bugzilla bug number: 4582 (and also 2008)
Signed-off-by: Steve French <sfrench@us.ibm.com>
Thanks to Doug Chapman for pointing out that the comment here is
inconsistent with the function prototype.
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Since posix_test_lock(), like fcntl() and ->lock(), indicates absence or
presence of a conflict lock by setting fl_type to, respectively, F_UNLCK
or something other than F_UNLCK, the return value is no longer needed.
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
As Peter Staubach says elsewhere
(http://marc.info/?l=linux-kernel&m=118113649526444&w=2):
> The problem is that some file system such as NFSv2 and NFSv3 do
> not have sufficient support to be able to support leases correctly.
> In particular for these two file systems, there is no over the wire
> protocol support.
>
> Currently, these two file systems fail the fcntl(F_SETLEASE) call
> accidentally, due to a reference counting difference. These file
> systems should fail more consciously, with a proper error to
> indicate that the call is invalid for them.
Define an nfs setlease method that just returns -EINVAL.
If someone can demonstrate a real need, perhaps we could reenable
them in the presence of the "nolock" mount option.
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Cc: Peter Staubach <staubach@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Since gfs2 can't prevent conflicting opens or leases on other nodes, we
probably shouldn't allow it to give out leases at all.
Put the newly defined lease operation into use in gfs2 by turning off
lease, unless we're using the "nolock' locking module (in which case all
locking is local anyway).
Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Currently leases are only kept locally, so there's no way for a distributed
filesystem to enforce them against multiple clients. We're particularly
interested in the case of nfsd exporting a cluster filesystem, in which
case nfsd needs cluster-coherent leases in order to implement delegations
correctly.
Also add some documentation.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
We've been using the convention that vfs_foo is the function that calls
a filesystem-specific foo method if it exists, or falls back on a
generic method if it doesn't; thus vfs_foo is what is called when some
other part of the kernel (normally lockd or nfsd) wants to get a lock,
whereas foo is what filesystems call to use the underlying local
functionality as part of their lock implementation.
So rename setlease to vfs_setlease (which will call a
filesystem-specific setlease after a later patch) and __setlease to
setlease.
Also, vfs_setlease need only be GPL-exported as long as it's only needed
by lockd and nfsd.
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Share more code between setlease (used by nfsd) and fcntl.
Also some minor cleanup.
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Acked-by: Christoph Hellwig <hch@infradead.org>
Return the newly allocated structure as the return value instead of
using a struct ** parameter.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
There's no point trying to return an error in these cases, which all represent
bugs in the callers.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
clarify that break_lease() checks for presence of any lock, not just leases.
Signed-off-by: David M. Richter <richterd@citi.umich.edu>
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Addressing patch from Stefan Richter:
HOWTO: update URLs of git trees
(It will be better if we update this to commit-id later)
Signed-off-by: Li Yang <leoli@freescale.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This is a Chinese translated version of
Documentation/stable_api_nonsense.txt.
From: TripleX <zhongyu@18mail.cn>
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Li Yang <leoli@freescale.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This is a Chinese translated version of Documentation/HOWTO. Currently
Chinese involvement in Linux kernel is very low, especially comparing to
its largest population base. Language could be the main obstacle. Hope
this document will help more Chinese to contribute to Linux kernel.
Signed-off-by: Li Yang <leoli@freescale.com>
Signed-off-by: TripleX Chung <xxx.phy@gmail.com>
Signed-off-by: Maggie Chen <chenqi@beyondsoft.com>
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add the japanese translation of the Documentation/HOWTO file.
Signed-off-by: Tsugikazu Shibata <tshibata@ab.jp.nec.com>
Cc: IKEDA Munehiro <m-ikeda@ds.jp.nec.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
this is a patch that adds support for Hilscher CIF DeviceNet and
Profibus cards. I tested it on a Kontron CPX board, and Thomas reviewed
it.
You can find the user space part here:
http://www.osadl.org/projects/downloads/UIO/user/cif-0.1.0.tar.gz
Notes: cif_api.c is the main file you want to look at. It contains the
functions to open, close, mmap and so on. cif_dps.c adds functions
specific to Profibus cards, and cif_dn.c contains functions for
DeviceNet cards. cif.c is a universal playground, it's just a small
test program. The user space part of this UIO driver is still work in
progress, and not everything is tested yet. At the moment, the thread in
cif_api.c contains some code that artificially makes the card generate
interrupts, this was added for testing and will be removed later. But
the driver already contains all the functions needed for useful
operation, so it gives a good idea of how such a thing looks like.
For comparison, here's what you get from the manufacturer
(www.hilscher.com) when you ask for a Linux 2.6 driver:
http://www.tglx.de/private/hjk/cif-orig-2.6.tar.bz2
WARNING: Don't look at the code for too long, you might become sick :-)
Signed-off-by: Hans-Jürgen Koch <hjk@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This interface allows the ability to write the majority of a driver in
userspace with only a very small shell of a driver in the kernel itself.
It uses a char device and sysfs to interact with a userspace process to
process interrupts and control memory accesses.
See the docbook documentation for more details on how to use this
interface.
From: Hans J. Koch <hjk@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Benedikt Spranger <b.spranger@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Node addition failure is detected by testing return value of
sysfs_addfm_finish() which returns the number of added and removed
nodes. As the function is called as the last step of addition right
on top of error handling block, the if blocks looked like the
following.
if (sysfs_addrm_finish(&acxt))
success handling, usually return;
/* fall through to error handling */
This is the opposite of usual convention in sysfs and makes the code
difficult to understand. This patch inverts the test and makes those
blocks look more like others.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Gabriel C <nix.or.die@googlemail.com>
Cc: Miles Lane <miles.lane@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
There is a subtle bug in sysfs_create_link() failure path. When
symlink creation fails because there's already a node with the same
name, the target sysfs_dirent is put twice - once by failure path of
sysfs_create_link() and once more when the symlink is released.
Fix it by making only the symlink node responsible for putting
target_sd.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Gabriel C <nix.or.die@googlemail.com>
Cc: Miles Lane <miles.lane@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Check for return value of sysfs_create_link() in device_add() and
device_rename(). Add helper functions device_add_class_symlinks() and
device_remove_class_symlinks() to make the code easier to read.
[akpm@linux-foundation.org: fix unused var warnings]
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Acked-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We should let everybody know about where the regression
list is hosted. The more is known the more it is used.
Signed-off-by: Paolo Ciarrocchi <paolo.ciarrocchi@gmail.com>
Cc: Li Yang <leoli@freescale.com>
Cc: TripleX Chung <xxx.phy@gmail.com>
Cc: Maggie Chen <chenqi@beyondsoft.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Tsugikazu Shibata <tshibata@ab.jp.nec.com>
Cc: IKEDA Munehiro <m-ikeda@ds.jp.nec.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Update CodingStyle to talk about "-DDEBUG" message conventions and the
new "-DVERBOSE_DEBUG" convention.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This defines a dev_vdbg() call, which is enabled with -DVERBOSE_DEBUG.
When enabled, dev_vdbg() acts just like dev_dbg(). When disabled, it is a
NOP ... just like dev_dbg() without -DDEBUG. The specific code was moved
out of a USB patch, but lots of drivers have similar support.
That is, code can now be written to use an additional level of debug
output, selected at compile time. Many driver authors have found this
idiom to be very useful. A typical usage model is for "normal" debug
messages to focus on fault paths and not be very "chatty", so that those
messages can be left on during normal operation without much of a
performance or syslog load. On the other hand "verbose" messages would be
noisy enough that they wouldn't normally be enabled; they might even affect
timings enough to change system or driver behavior.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
With sysfs_fill_super() converted to use sysfs_get_inode(), there is
no user of sysfs_init_inode() outside of fs/sysfs/inode.c. Make it
static.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
While making sysfs indoes hashed, sysfs root inode was left out. Now
that nlink accounting depends on the inode being on the hash, sysfs
root inode nlink isn't adjusted properly.
Put sysfs root inode on the inode hash by allocating it using
sysfs_get_inode() like other sysfs inodes. While at it, massage
comments a bit.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
kmem_cache_free() with NULL is not allowed. But it may happen
if out of memory error is triggered in sysfs_new_dirent().
This patch fixes that error handling.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch (as933) removes the deprecated dpm_runtime_suspend() and
dpm_runtime_resume() routines from the PM core. The only user of
those routines is the PCMCIA ds driver; local replacements are added.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This allows the uevent file to handle any type of uevent action to be
triggered by userspace instead of just the "add" uevent.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Hi,
This patch kills the pointless debugfs rmdir() printk() when called on a
non-empty directory. blktrace will sometimes have to call it a few times
when forcefully ending a trace, which polutes the log with pointless
warnings.
Rationale:
- It's more code to work-around this "problem" in the debugfs users, and
you would have to add code to check for empty directories to do so (or
assume that debugfs is using simple_ helpers, but that would be a
layering violation).
- Other rmdir() implementations don't complain about something this
silly.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The driver didn't allow an interface's MAC address to be modified if the
respective interface wasn't setup - a failing Hcall was the result. Thus
bonding wasn't usable. The fix moves the failing Hcall which was registering
a MAC address for the reception of BC packets in firmware from the port up
and down functions to the port resources setup functions. Additionally the
missing update of the last_rx member of the netdev structure was added.
Signed-off-by: Thomas Klein <tklein@de.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
This patch implements the driver necessary use the Analog Devices
Blackfin processor's on-chip ethernet MAC controller.
[try#2]
- add timeout control
- kill dma_config_reg bitfields
- some trivial cleanup
[try#3]
- add endianess check
- add DRV_NAME, DRV_VERSION... driver information string
- add some comments for silicon anomaly and dma API confusion
- some code trivial cleanup
[try#4]
- add Blackfin latest GPIO pin mux opertion with Michael Hennerich's
help and Dan's review
- rewrite the DMA descriptor list operation in a more readable way
by Joe's review
[try#5]
- cleanup some coding style by Joe's review.
[try#6]
- 1.1 version fix a bug when set up multicast list pointed by Mr. yoshfuji
- rearrange the desc_list_free function.
Signed-off-by: Michael Hennerich <michael.hennerich@analog.com>
Signed-off-by: Bryan Wu <bryan.wu@analog.com>
Cc: Michael Buesch <mb@bu3sch.de>
Cc: Mike Frysinger <vapier.adi@gmail.com>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dcbw@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>