* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6: (59 commits)
[XFS] eagerly remove vmap mappings to avoid upsetting Xen
[XFS] simplify validata_fields
[XFS] no longer using io_vnode, as was remaining from 23 cherrypick
[XFS] Remove STATIC which was missing from prior manual merge
[XFS] Put back the QUEUE_ORDERED_NONE test in the barrier check.
[XFS] Turn off XBF_ASYNC flag before re-reading superblock.
[XFS] avoid race in sync_inodes() that can fail to write out all dirty data
[XFS] This fix prevents bulkstat from spinning in an infinite loop.
[XFS] simplify xfs_create/mknod/symlink prototype
[XFS] avoid xfs_getattr in XFS_IOC_FSGETXATTR ioctl
[XFS] get_bulkall() could return incorrect inode state
[XFS] Kill unused IOMAP_EOF flag
[XFS] fix when DMAPI mount option processing happens
[XFS] ensure file size is logged on synchronous writes
[XFS] growlock should be a mutex
[XFS] replace some large xfs_log_priv.h macros by proper functions
[XFS] kill struct bhv_vfs
[XFS] move syncing related members from struct bhv_vfs to struct xfs_mount
[XFS] kill the vfs_flags member in struct bhv_vfs
[XFS] kill the vfs_fsid and vfs_altfsid members in struct bhv_vfs
...
I_LOCK was used for several unrelated purposes, which caused deadlock
situations in certain filesystems as a side effect. One of the purposes
now uses the new I_SYNC bit.
Also document the various bits and change their order from historical to
logical.
[bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
Signed-off-by: Joern Engel <joern@wohnheim.fh-wedel.de>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Anton Altaparmakov <aia21@cam.ac.uk>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug:
> The following strange behavior can be observed:
>
> 1. large file is written
> 2. after 30 seconds, nr_dirty goes down by 1024
> 3. then for some time (< 30 sec) nothing happens (disk idle)
> 4. then nr_dirty again goes down by 1024
> 5. repeat from 3. until whole file is written
>
> So basically a 4Mbyte chunk of the file is written every 30 seconds.
> I'm quite sure this is not the intended behavior.
It can be produced by the following test scheme:
# cat bin/test-writeback.sh
grep nr_dirty /proc/vmstat
echo 1 > /proc/sys/fs/inode_debug
dd if=/dev/zero of=/var/x bs=1K count=204800&
while true; do grep nr_dirty /proc/vmstat; sleep 1; done
# bin/test-writeback.sh
nr_dirty 19207
nr_dirty 19207
nr_dirty 30924
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s
nr_dirty 47150
nr_dirty 47141
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47205
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47215
nr_dirty 47216
nr_dirty 47216
nr_dirty 47216
nr_dirty 47154
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47134
nr_dirty 47134
nr_dirty 47135
nr_dirty 47135
nr_dirty 47135
nr_dirty 46097 <== -1038
nr_dirty 46098
nr_dirty 46098
nr_dirty 46098
[...]
nr_dirty 46091
nr_dirty 46092
nr_dirty 46092
nr_dirty 45069 <== -1023
nr_dirty 45056
nr_dirty 45056
nr_dirty 45056
[...]
nr_dirty 37822
nr_dirty 36799 <== -1023
[...]
nr_dirty 36781
nr_dirty 35758 <== -1023
[...]
nr_dirty 34708
nr_dirty 33672 <== -1024
[...]
nr_dirty 33692
nr_dirty 32669 <== -1023
% ls -li /var/x
847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x
% dmesg|grep 847824 # generated by a debug printk
[ 529.263184] redirtied inode 847824 line 548
[ 564.250872] redirtied inode 847824 line 548
[ 594.272797] redirtied inode 847824 line 548
[ 629.231330] redirtied inode 847824 line 548
[ 659.224674] redirtied inode 847824 line 548
[ 689.219890] redirtied inode 847824 line 548
[ 724.226655] redirtied inode 847824 line 548
[ 759.198568] redirtied inode 847824 line 548
# line 548 in fs/fs-writeback.c:
543 if (wbc->pages_skipped != pages_skipped) {
544 /*
545 * writeback is not making progress due to locked
546 * buffers. Skip this inode for now.
547 */
548 redirty_tail(inode);
549 }
More debug efforts show that __block_write_full_page()
never has the chance to call submit_bh() for that big dirty file:
the buffer head is *clean*. So basicly no page io is issued by
__block_write_full_page(), hence pages_skipped goes up.
Also the comment in generic_sync_sb_inodes():
544 /*
545 * writeback is not making progress due to locked
546 * buffers. Skip this inode for now.
547 */
and the comment in __block_write_full_page():
1713 /*
1714 * The page was marked dirty, but the buffers were
1715 * clean. Someone wrote them back by hand with
1716 * ll_rw_block/submit_bh. A rare case.
1717 */
do not quite agree with each other. The page writeback should be skipped for
'locked buffer', but here it is 'clean buffer'!
This patch fixes this bug. Though I'm not sure why __block_write_full_page()
is called only to do nothing and who actually issued the writeback for us.
This is the two possible new behaviors after the patch:
1) pretty nice: wait 30s and write ALL:)
2) not so good:
- during the dd: ~16M
- after 30s: ~4M
- after 5s: ~4M
- after 5s: ~176M
The next patch will fix case (2).
Cc: David Chinner <dgc@sgi.com>
Cc: Ken Chen <kenchen@google.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Slab constructors currently have a flags parameter that is never used. And
the order of the arguments is opposite to other slab functions. The object
pointer is placed before the kmem_cache pointer.
Convert
ctor(void *object, struct kmem_cache *s, unsigned long flags)
to
ctor(struct kmem_cache *s, void *object)
throughout the kernel
[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
XFS leaves stray mappings around when it vmaps memory to make it virtually
contigious. This upsets Xen if one of those pages is being recycled into a
pagetable, since it finds an extra writable mapping of the page.
This patch solves the problem in a brute force way, by making XFS always
eagerly unmap its mappings.
SGI-PV: 971902
SGI-Modid: xfs-linux-melb:xfs-kern:29886a
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Stop using xfs_getattr and a onstack bhv_vattr_t just to get three fields
from the underlying inode and opencode copying from the inode fields
instead.
SGI-PV: 970662
SGI-Modid: xfs-linux-melb:xfs-kern:29711a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Because we cherrypicked SGI-Modid xfs-linux-melb:xfs-kern:29675a
and it depended on the sgi mod which removed io_vnode (which was
not cherrypicked in 23) it was hand modified.
This fixes things back up (to the originial mod) now we have moved
on again.
Reviewed-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Put back the QUEUE_ORDERED_NONE test which caused us grief in sles when it
was taken out as, IIRC, it allowed md/lvm to be thought of as supporting
barriers when they weren't in some configurations. This patch will be
reverting what went in as part of a change for the SGI-pv 964544
(SGI-Modid: xfs-linux-melb:xfs-kern:28568a).
SGI-PV: 971783
SGI-Modid: xfs-linux-melb:xfs-kern:29882a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
In xfs_fs_sync_super() treat a sync the same as a filesystem freeze. This
is needed to force the log to disk for inodes which are not marked dirty
in the Linux inode (the inodes are marked dirty on completion of the log
I/O) and so sync_inodes() will not flush them.
In xfs_fs_write_inode() a synchronous flush will not get an EAGAIN from
xfs_inode_flush() and if an asynchronous flush returns EAGAIN we should
pass it on to the caller. If we get an error while flushing the inode then
re-dirty it so we can try again later.
SGI-PV: 971670
SGI-Modid: xfs-linux-melb:xfs-kern:29860a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Simplify the prototype for xfs_create/xfs_mkdir/xfs_symlink by not passing
down a bhv_vattr_t that just hogs stack space. Instead pass down the mode
in a mode_t and in case of xfs_create the rdev as a scalar type as well.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29794a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
No need to call into xfs_getattr and put a big bhv_vattr_t on the stack
just to get a little information from the XFS inode.
Add a helper called xfs_ioc_fsgetxattr instead that deals with retrieving
the information in a clean way.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29780a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Synchronous writes currently log inode changes before syncing pages to
disk. Since the file size is updated on I/O completion we wont be writing
out the updated file size and if we crash the file will have the wrong
size. This change moves the logging after the syncing of the pages to
ensure we log the correct file size.
SGI-PV: 970334
SGI-Modid: xfs-linux-melb:xfs-kern:29649a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Now that struct bhv_vfs doesn't have any members left we can kill it and
go directly from the super_block to the xfs_mount everywhere.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29509a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
All flags are added to xfs_mount's m_flag instead. Note that the 32bit
inode flag was duplicated in both of them, but only cleared in the mount
when it was not nessecary due to the filesystem beeing small enough. Two
flags are still required here - one to indicate the mount option setting,
and one to indicate if it applies or not.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29507a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
vfs_altfsid was just a pointer to mp->m_fixedfsid so we can trivially
replace it with the latter. vfs_fsid also was identical to m_fixedfsid
through rather obfuscated ways so we can kill it as well and simply its
only user.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29506a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Also remove the now dead behavior code.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29505a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
All vfs ops now take struct xfs_mount pointers and the behaviour related
glue is split out into methods of its own.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29504a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Mount options are now parsed by the main XFS module and rejected if quota
support is not available, and there are some new quota operation for the
quotactl syscall and calls to quote in the mount, unmount and sync
callchains.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29503a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Mount options are now parsed by the main XFS module and rejected if dmapi
support is not available, and there is a new dm operation to send the
mount event.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29502a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
In the next patch we need to look at the mount structure until just before
it's freed, so we need to be able to free it as the very last thing in
xfs_unmount.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29501a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Now that struct bhv_vnode is empty we can just kill it. Retain bhv_vnode_t
as a typedef for struct inode for the time being until all the fallout is
cleaned up.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29500a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
It's entirely unused except for ignored arguments in the mrlock
initialization, so remove it.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29499a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
struct bhv_vnode is on it's way out, so move the trace buffer to the XFS
inode. Note that this makes the tracing macros rather misnamed, but this
kind of fallout will be fixed up incrementally later on.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29498a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
struct bhv_vnode is on it's way out, so move the I/O count to the XFS
inode.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29497a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
All flags previously handled at the vnode level are not in the xfs_inode
where we already have a flags mechanisms and free bits for flags
previously in the vnode.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29495a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
We can easily get at the vfsp through the super_block but it will soon be
gone anyway.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29494a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
All vnode ops now take struct xfs_inode pointers and the behaviour related
glue is split out into methods of it's own. This required fixing
xfs_create/mkdir/symlink to not mess with the inode pointer but rather use
a separate boolean for error handling. Thanks to Dave Chinner for that
fix.
SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29492a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
One of the perpetual scaling problems XFS has is indexing it's incore
inodes. We currently uses hashes and the default hash sizes chosen can
only ever be a tradeoff between memory consumption and the maximum
realistic size of the cache.
As a result, anyone who has millions of inodes cached on a filesystem
needs to tunes the size of the cache via the ihashsize mount option to
allow decent scalability with inode cache operations.
A further problem is the separate inode cluster hash, whose size is based
on the ihashsize but is smaller, and so under certain conditions (sparse
cluster cache population) this can become a limitation long before the
inode hash is causing issues.
The following patchset removes the inode hash and cluster hash and
replaces them with radix trees to avoid the scalability limitations of the
hashes. It also reduces the size of the inodes by 3 pointers....
SGI-PV: 969561
SGI-Modid: xfs-linux-melb:xfs-kern:29481a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Kill uio related functions and defines now that they're unused.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29480a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Simplify the readlink code to get rid of the last user of uio.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29479a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Currently xfs has a rather complicated internal scheme to allow for
different directory formats in IRIX. This patch rips all code related to
this out and pushes useage of the Linux filldir callback into the lowlevel
directory code. This does not make the code any less portable because
filldir can be used to create dirents of all possible variations
(including the IRIX ones as proved by the IRIX binary emulation code under
arch/mips/).
This patch get rid of an unessecary copy in the readdir path, about 400
lines of code and one of the last two users of the uio structure.
This version is updated to deal with dmapi aswell which greatly simplifies
the get_dirattrs code. The dmapi part has been tested using the
get_dirattrs tools from the xfstest dmapi suite1 with various small and
large directories.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29478a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Generally we try not to directly include linux header files in core xfs
code; xfs_linux.h is the spot for that.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29326a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Now that nobody's using it, remove xfs_physmem & friends.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29325a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
m_nreadaheads in the mount struct is never used; remove it and the various
macros assigned to it. Also remove a couple other unused macros in the
same areas.
Removes one user of xfs_physmem.
SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29322a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
If the underlying block device suddenly stops supporting barriers, we need
to handle the -EOPNOTSUPP error in a sane manner rather than shutting
down the filesystem. If we get this error, clear the barrier flag, reissue
the I/O, and tell the world bad things are occurring.
SGI-PV: 964544
SGI-Modid: xfs-linux-melb:xfs-kern:28568a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant. Remove it.
Now there is no need for bio_endio to subtract the size completed
from bi_size. So don't do that either.
While we are at it, change bi_end_io to return void.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
- in xfs_probe_cluster rename the inner len to pg_len. There's no harm
here because the outer len isn't used after the inner len comes into
existence but it keeps the code clean.
- in xfs_da_do_buf remove the inner i because they don't overlap
and they are both the same type.
SGI-PV: 968555
SGI-Modid: xfs-linux-melb:xfs-kern:29311a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
We can't return a masked result of a __bitwise type. Compare it to 0 first
to keep the behaviour without the warning.
SGI-PV: 968555
SGI-Modid: xfs-linux-melb:xfs-kern:29309a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.
This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6:
[XFS] Fix inode size update before data write in xfs_setattr
[XFS] Allow punching holes to free space when at ENOSPC
[XFS] Implement ->page_mkwrite in XFS.
[FS] Implement block_page_mkwrite.
Manually fix up conflict with Nick's VM fault handling patches in
fs/xfs/linux-2.6/xfs_file.c
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change ->fault prototype. We now return an int, which contains
VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
FAULT_RET_ code tells the VM whether a page was found, whether it has been
locked, and potentially other things. This is not quite the way he wanted
it yet, but that's changed in the next patch (which requires changes to
arch code).
This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
that a page is locked which requires filemap_nopage to go away (because we
can no longer remain backward compatible without that flag), but we were
going to do that anyway.
struct fault_data is renamed to struct vm_fault as Linus asked. address
is now a void __user * that we should firmly encourage drivers not to use
without really good reason.
The page is now returned via a page pointer in the vm_fault struct.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
the virtual address -> file offset differently from linear mappings.
->populate is a layering violation because the filesystem/pagecache code
should need to know anything about the virtual memory mapping. The hitch here
is that the ->nopage handler didn't pass down enough information (ie. pgoff).
But it is more logical to pass pgoff rather than have the ->nopage function
calculate it itself anyway (because that's a similar layering violation).
Having the populate handler install the pte itself is likewise a nasty thing
to be doing.
This patch introduces a new fault handler that replaces ->nopage and
->populate and (later) ->nopfn. Most of the old mechanism is still in place
so there is a lot of duplication and nice cleanups that can be removed if
everyone switches over.
The rationale for doing this in the first place is that nonlinear mappings are
subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
to duplicate the synchronisation logic rather than just consolidate the two.
After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
pagecache. Seems like a fringe functionality anyway.
NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no
users have hit mainline yet.
[akpm@linux-foundation.org: cleanup]
[randy.dunlap@oracle.com: doc. fixes for readahead]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>