* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
[POWERPC] ehea: Remove dependency on MEMORY_HOTPLUG
[POWERPC] Make walk_memory_resource available with MEMORY_HOTPLUG=n
[POWERPC] Use dev_set_name in pci_64.c
[POWERPC] Fix incorrect enabling of VMX when building signal or user context
[POWERPC] boot/Makefile CONFIG_ variable fixes
Minor source code cleanup of page flags in mm/page_alloc.c.
Move the definition of the groups of bits to page-flags.h.
The purpose of this clean up is that the next patch will
conditionally add a page flag to the groups. Doing that
in a header file is cleaner than adding #ifdefs to the
C code.
Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ehea driver was recently changed[1] to use walk_memory_resource() to
detect the system's memory layout. However, walk_memory_resource() is
available only when memory hotplug is enabled. So CONFIG_EHEA was
made to depend on MEMORY_HOTPLUG [2], but it is inappropriate for a
network driver to have such a dependency.
Make the declaration of walk_memory_resource() and its powerpc
implementation (ehea is powerpc-specific) unconditionally available.
[1] 48cfb14f8b
"ehea: Add DLPAR memory remove support"
[2] fb7b6ca2b6
"ehea: Add dependency to Kconfig"
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Commit 73f20e58b1 ("FAT_VALID_MEDIA():
remove pointless test") wrongly added the new fat_valid_media() function
to the userspace-visible part of include/linux/msdos_fs.h
Move it to the part of include/linux/msdos_fs.h that is not exported to
userspace.
Reported-by: Onur Küçük <onur@pardus.org.tr>
Reported-by: S.Çağlar Onur <caglar@pardus.org.tr>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: enable barriers by default
jbd2: Fix barrier fallback code to re-lock the buffer head
ext4: Display the journal_async_commit mount option in /proc/mounts
jbd2: If a journal checksum error is detected, propagate the error to ext4
jbd2: Fix memory leak when verifying checksums in the journal
ext4: fix online resize bug
ext4: Fix uninit block group initialization with FLEX_BG
ext4: Fix use of uninitialized data with debug enabled.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/chrisw/lsm-2.6:
capabilities: remain source compatible with 32-bit raw legacy capability support.
LSM: remove stale web site from MAINTAINERS
To get zeroed out memory from a particular NUMA node. To be used by
sunrpc.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When posting:
[PATCH 1/8] Scaling msgmni to the amount of lowmem
(see http://article.gmane.org/gmane.linux.kernel/637849/) I changed the
MSGPOOL value to make it fit what is said in the man pages (i.e. a size
in bytes).
But Michael Kerrisk rightly complained that this change could affect the
ABI. So I'm posting this patch to make MSGPOOL expressed back in Kbytes.
Michael, on his side, has fixed the man page.
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Pierre Peiffer <peifferp@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces memory_read_from_buffer().
The only difference between memory_read_from_buffer() and
simple_read_from_buffer() is which address space the function copies to.
simple_read_from_buffer copies to user space memory.
memory_read_from_buffer copies to normal memory.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Doug Warzecha <Douglas_Warzecha@dell.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Matt Domsch <Matt_Domsch@dell.com>
Cc: Abhay Salunke <Abhay_Salunke@dell.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Markus Rechberger <markus.rechberger@amd.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Bob Moore <robert.moore@intel.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Len Brown <lenb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Oberparleiter <peter.oberparleiter@de.ibm.com>
Cc: Michael Holzheu <holzheu@de.ibm.com>
Cc: Brian King <brking@us.ibm.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Andrew Vasquez <linux-driver@qlogic.com>
Cc: Seokmann Ju <seokmann.ju@qlogic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bluetooth will be able to use this.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Dave Young <hidave.darkstar@gmail.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Although if people have questions about ARCnet, perhaps it's _better_
for them to be mailing dwmw2@cam.ac.uk about it...
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These were removed in commit 26d507fcfe:
> -#define V4L2_CID_HCENTER (V4L2_CID_BASE+22)
> -#define V4L2_CID_VCENTER (V4L2_CID_BASE+23)
> -#define V4L2_CID_LASTP1 (V4L2_CID_BASE+24) /*
> last CID + 1 */
> +
> +/* Deprecated, use V4L2_CID_PAN_RESET and V4L2_CID_TILT_RESET */
> +#define V4L2_CID_HCENTER_DEPRECATED (V4L2_CID_BASE+22)
> +#define V4L2_CID_VCENTER_DEPRECATED (V4L2_CID_BASE+23)
But there was no warning in Documentation/feature-removal-schedule.txt
and I'm receiving reports that it's breaking userspace apps (the
gstreamer-v4l2 plugin breaks in Fedora rawhide). You can't just pull
things from the published userspace API like that.
Please can we revert the addition of _DEPRECATED to these ioctl
definitions. Perhaps we can add a runtime warning if they actually get
used? Or a compile-time warning if we can manage that?
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (56 commits)
l2tp: Fix possible oops if transmitting or receiving when tunnel goes down
tcp: Fix for race due to temporary drop of the socket lock in skb_splice_bits.
tcp: Increment OUTRSTS in tcp_send_active_reset()
raw: Raw socket leak.
lt2p: Fix possible WARN_ON from socket code when UDP socket is closed
USB ID for Philips CPWUA054/00 Wireless USB Adapter 11g
ssb: Fix context assertion in ssb_pcicore_dev_irqvecs_enable
libertas: fix command size for CMD_802_11_SUBSCRIBE_EVENT
ipw2200: expire and use oldest BSS on adhoc create
airo warning fix
b43legacy: Fix controller restart crash
sctp: Fix ECN markings for IPv6
sctp: Flush the queue only once during fast retransmit.
sctp: Start T3-RTX timer when fast retransmitting lowest TSN
sctp: Correctly implement Fast Recovery cwnd manipulations.
sctp: Move sctp_v4_dst_saddr out of loop
sctp: retran_path update bug fix
tcp: fix skb vs fack_count out-of-sync condition
sunhme: Cleanup use of deprecated calls to save_and_cli and restore_flags.
xfrm: xfrm_algo: correct usage of RIPEMD-160
...
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
libata-sff: Fix oops reported in kerneloops.org for pnp devices with no ctl
libata: kill unused constants
sata_mv: PHY_MODE4 cleanups
[libata] ata_piix: more acer short cable quirks
[libata] ACPI: Properly handle bay devices in dock stations
- Make ata_sff_altstatus private so nobody uses it by mistake
- Drop the 400nS delay from it
Add
ata_sff_irq_status - encapsulates the IRQ check logic
This function keeps the existing behaviour for altstatus using devices. I
actually suspect the logic was wrong before the changes but -rc isn't the
time to play with that
ata_sff_sync - ensure writes hit the device
Really we want an io* operation for 'is posted' eg ioisposted(ioaddr) so
that we can fix the nasty delay this causes on most systems.
- ata_sff_pause - 400nS delay
Ensure the command hit the device and delay 400nS
- ata_sff_dma_pause
Ensure the I/O hit the device and enforce an HDMA1:0 transition delay.
Requires altstatus register exists, BUG if not so we don't risk
corruption in MWDMA modes. (UDMA the checksum will save your backside in
theory)
The only other complication then is devices with their own handlers.
rb532 can use dma_pause but scc needs to access its own altstatus
register for internal errata workarounds so directly call the drivers own
altstatus function.
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
The field was supposed to allow the creation of an anycast route by
assigning an anycast address to an address prefix. It was never
implemented so this field is unused and serves no purpose. Remove it.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also removes an unused policy entry for an attribute which is
only used in kernel->user direction.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also removes an obsolete check for the unused flag RTCF_MASQ.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tty layer provides a callback that is used when the line discipline
is changed. Some hardware uses this to configure hardware specific
features such as IrDA mode on serial ports. Unfortunately the serial
layer does not provide this feature or pass it down to drivers.
Blackfin used to hack around this by rewriting the tty ops, but those are
now properly shared and const so the hack fails. Instead provide the
proper operations.
This change plus a follow up from the Blackfin guys is needed to avoid
blackfin losing features in this release.
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since mmc_spi.h uses irqreturn_t type, it should include appropriate
header, otherwise build will break if users didn't include it (some of
them do not use interrupts).
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Source code out there hard-codes a notion of what the
_LINUX_CAPABILITY_VERSION #define means in terms of the semantics of the
raw capability system calls capget() and capset(). Its unfortunate, but
true.
Since the confusing header file has been in a released kernel, there is
software that is erroneously using 64-bit capabilities with the semantics
of 32-bit compatibilities. These recently compiled programs may suffer
corruption of their memory when sys_getcap() overwrites more memory than
they are coded to expect, and the raising of added capabilities when using
sys_capset().
As such, this patch does a number of things to clean up the situation
for all. It
1. forces the _LINUX_CAPABILITY_VERSION define to always retain its
legacy value.
2. adopts a new #define strategy for the kernel's internal
implementation of the preferred magic.
3. deprecates v2 capability magic in favor of a new (v3) magic
number. The functionality of v3 is entirely equivalent to v2,
the only difference being that the v2 magic causes the kernel
to log a "deprecated" warning so the admin can find applications
that may be using v2 inappropriately.
[User space code continues to be encouraged to use the libcap API which
protects the application from details like this. libcap-2.10 is the first
to support v3 capabilities.]
Fixes issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=447518.
Thanks to Bojan Smojver for the report.
[akpm@linux-foundation.org: s/depreciate/deprecate/g]
[akpm@linux-foundation.org: be robust about put_user size]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Bojan Smojver <bojan@rexursive.com>
Cc: stable@kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
lguest: notify on empty
virtio: force callback on empty.
virtio_blk: fix endianess annotations
virtio_config: fix len calculation of config elements
virtio_net: another race with virtio_net and enable_cb
virtio: An entropy device, as suggested by hpa.
virtio_blk: allow read-only disks
lguest: fix ugly <NULL> in /proc/interrupts
virtio: set device index in common code.
virtio: virtio_pci should not set bus_id.
virtio: bus_id for devices should contain 'virtio'
Fix crash in virtio_blk during modprobe ; rmmod ; modprobe
lguest: use ioremap_cache, not ioremap
The SW_RADIO code for EV_SW events has a name that is not descriptive
enough of its intended function, and could induce someone to think
KEY_RADIO is its EV_KEY counterpart, which is false.
Rename it to SW_RFKILL_ALL, and document what this event is for. Keep
the old name around, to avoid userspace ABI breaks.
The SW_RFKILL_ALL event is meant to be used by rfkill master switches. It
is not bound to a particular radio switch type, and usually applies to all
types. It is semantically tied to master rfkill switches that enable or
disable every radio in a system.
Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
virtio allows drivers to suppress callbacks (ie. interrupts) for
efficiency (no locking, it's just an optimization).
There's a similar mechanism for the host to suppress notifications
coming from the guest: in that case, we ignore the suppression if the
ring is completely full.
It turns out that life is simpler if the host similarly ignores
callback suppression when the ring is completely empty: the network
driver wants to free up old packets in a timely manner, and otherwise
has to use a timer to poll.
We have to remove the code which ignores interrupts when the driver
has disabled them (again, it had no locking and hence was unreliable
anyway).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Since commit 72e61eb40b (virtio: change config
to guest endian) config space is no longer fixed endian.
Lets change the virtio_blk_config variables.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty,
This patch is a prereq for the virtio_blk blocksize patch, please apply it
first.
Adding an u32 value to the virtio_blk_config unconvered a small bug the config
space defintions:
v is a pointer, to we have to use sizeof(*v) instead of sizeof(v).
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Note that by itself, having a "hardware" random generator does very
little: you should probably run "rngd" in your guest to feed this into
the kernel entropy pool.
Included:
virtio_rng: dont use vmalloced addresses for virtio
If virtio_rng is build as a module, random_data is an address
in vmalloc space. As virtio expects guest real addresses, this
can cause any kind of funny behaviour, so lets allocate
random_data dynamically with kmalloc.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Hello Rusty,
sometimes it is useful to share a disk (e.g. usr). To avoid file system
corruption, the disk should be mounted read-only in that case. This patch
adds a new feature flag, that allows the host to specify, if the disk should
be considered read-only.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Create the dev_set_name function now so that various subsystems can
start changing over to it before other changes in 2.6.27 will make it
compulsory.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
improve the sysbench ramp-up phase and its peak throughput on
a 16way NUMA box, by turning on WAKE_AFFINE:
tip/sched tip/sched+wake-affine
-------------------------------------------------
1: 700 830 +15.65%
2: 1465 1391 -5.28%
4: 3017 3105 +2.81%
8: 5100 6021 +15.30%
16: 10725 10745 +0.19%
32: 10135 10150 +0.16%
64: 9338 9240 -1.06%
128: 8599 8252 -4.21%
256: 8475 8144 -4.07%
-------------------------------------------------
SUM: 57558 57882 +0.56%
this change also improves lat_ctx from 6.69 usecs to 1.11 usec:
$ ./lat_ctx -s 0 2
"size=0k ovr=1.19
2 1.11
$ ./lat_ctx -s 0 2
"size=0k ovr=1.22
2 6.69
in sysbench it's an overall win with some weakness at the lots-of-clients
side. That happens because we now under-balance this workload
a bit. To counter that effect, turn on NEWIDLE:
wake-idle wake-idle+newidle
-------------------------------------------------
1: 830 834 +0.43%
2: 1391 1401 +0.65%
4: 3105 3091 -0.43%
8: 6021 6046 +0.42%
16: 10745 10736 -0.08%
32: 10150 10206 +0.55%
64: 9240 9533 +3.08%
128: 8252 8355 +1.24%
256: 8144 8384 +2.87%
-------------------------------------------------
SUM: 57882 58591 +1.21%
as a bonus this not only improves the many-clients case but
also improves the (more important) rampup phase.
sysbench is a workload that quickly breaks down if the
scheduler over-balances, so since it showed an improvement
under NEWIDLE this change is definitely good.
Yanmin Zhang reported:
Comparing with 2.6.25, volanoMark has big regression with kernel 2.6.26-rc1.
It's about 50% on my 8-core stoakley, 16-core tigerton, and Itanium Montecito.
With bisect, I located the following patch:
| 18d95a2832 is first bad commit
| commit 18d95a2832
| Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
| Date: Sat Apr 19 19:45:00 2008 +0200
|
| sched: fair-group: SMP-nice for group scheduling
Revert it so that we get v2.6.25 behavior.
Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently it uses a single static char array, but that risks
being corrupted when multiple users issue message notes at the
same time. Make the buffers dynamically allocated when the trace
is setup and make them per-cpu instead.
The default max message size of 1k is also very large, the
interface is mainly for small text notes. So shrink it to 128 bytes.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Allows messages to be inserted into blktrace streams.
Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Based on Roland's patch. This approach was suggested by Austin Clements
from the very beginning, and then by Linus.
As Austin pointed out, the execing task can be killed by SI_TIMER signal
because exec flushes the signal handlers, but doesn't discard the pending
signals generated by posix timers. Perhaps not a bug, but people find this
surprising. See http://bugzilla.kernel.org/show_bug.cgi?id=10460
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (52 commits)
vlan: Use bitmask of feature flags instead of seperate feature bits
fmvj18x_cs: add NextCom NC5310 rev B support
xirc2ps_cs: re-initialize the multicast address in do_reset
3C509: rx_bytes should not be increased when alloc_skb failed
NETFRONT: Use __skb_queue_purge()
VIRTIO: Use __skb_queue_purge()
phylib: do EXPORT_SYMBOL on get_phy_id
netlink: Fix nla_parse_nested_compat() to call nla_parse() directly
WAN: protect HDLC proto list while insmod/rmmod
drivers/net/fs_enet: remove null pointer dereference
S2io: Version update for napi and MSI-X patches
S2io: Added napi support when MSIX is enabled.
S2io: Move all the transmit completions to a single msi-x (alarm) vector
drivers/net/ehea - remove unnecessary memset after kzalloc
au1000_eth: remove useless check
Blackfin EMAC Driver: Removed duplicated include <linux/ethtool.h>
cpmac bugfixes and enhancements
e1000e: use resource_size_t, not unsigned long, for phys addrs
net/usb: add support for Apple USB Ethernet Adapter
uli526x: add support for netpoll
...
If a journal checksum error is detected, the ext4 filesystem will call
ext4_error(), and the mount will either continue, become a read-only
mount, or cause a kernel panic based on the superblock flags
indicating the user's preference of what to do in case of filesystem
corruption being detected.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Align i2c_device_id.driver_data to 8 bytes to not fail on crossbuilds.
(Added in d2653e92732bd3911feff6bee5e23dbf959381db.)
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
for_each_pgdat() was renamed to for_each_online_pgdat() and kerneldoc
comments should be updated accordingly.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes various gpio-related build errors (mostly potential)
reported in part by Russell King and Uwe Kleine-König.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Cc: Uwe Kleine-König <Uwe.Kleine-Koenig@digi.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Arnaud Patard <arnaud.patard@rtp-net.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To keep backwards compatibility, reverse the meanings of these flags so
that when they are not set, the driver uses the original behvaiour.
Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Cc: Arnaud Patard <arnaud.patard@rtp-net.org>
Acked-by: Krzysztof Helt <krzysztof.h1@wp.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.
For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.
We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
- We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
which causes the recovery not be be checkpointed.
- We remove spares and then re-added them which loses important state
information.
The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed. If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error. So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.
Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded). Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.
Issue: If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available, do we want to:
1/ complete the current rebuild, then go back and rebuild the new spare or
2/ restart the rebuild from the start and rebuild both devices in
parallel.
Both options can be argued for. The code currently takes option 2 as
a/ this requires least code change
b/ this results in a minimally-degraded array in minimal time.
Cc: "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In some configurations, a raid6 resync can be limited by CPU speed
(Calculating P and Q and moving data) rather than by device speed. In
these cases there is nothing to be gained byt serialising resync of arrays
that share a device, and doing the resync in parallel can provide benefit.
So add a sysfs tunable to flag an array as being allowed to resync in
parallel with other arrays that use (a different part of) the same device.
Signed-off-by: Bernd Schubert <bs@q-leap.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kill the trivial and rather pointless file_path wrapper around d_path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>