Commit Graph

180647 Commits

Author SHA1 Message Date
Martyn Welch
32a6275f30 powerpc/86xx: Enable VME driver on the GE SBC610
Enable the VME driver (which is currently in staging) on the SBC610.

Signed-off-by: Martyn Welch <martyn.welch@gefanuc.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2010-02-17 21:48:23 -06:00
Martyn Welch
f987d82b80 powerpc/86xx: Enable VME driver on the GE PPC9A
Enable the VME driver (which is currently in staging) on the PPC9A

Signed-off-by: Martyn Welch <martyn.welch@gefanuc.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2010-02-17 21:48:22 -06:00
Malcolm Crossley
41cbdeef37 powerpc/86xx: Add MSI section to GE PPC9A DTS
Add the MSI section to the DTS file for the GE PPC9A.

Signed-off-by: Malcolm Crossley <malcolm.crossley2@gefanuc.com>
Signed-off-by: Martyn Welch <martyn.welch@gefanuc.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2010-02-17 21:48:21 -06:00
Martyn Welch
26216e3e15 powerpc/86xx: Switch on highmem support on GE SBC610
Signed-off-by: Martyn Welch <martyn.welch@gefanuc.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2010-02-17 21:48:20 -06:00
Martyn Welch
ae1f7553b9 powerpc/86xx: Basic flash support for GE SBC610
Support for the SBC610 VPX Single Board Computer from GE (PowerPC MPC8641D).

This patch adds basic support for the on-board flash.

Signed-off-by: Martyn Welch <martyn.welch@gefanuc.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2010-02-17 21:48:05 -06:00
Malcolm Crossley
6459ba984a powerpc/86xx: Add MSI section to GE SBC610 DTS
Add the MSI section to the DTS file for the GE SBC610.

Signed-off-by: Malcolm Crossley <malcolm.crossley2@gefanuc.com>
Signed-off-by: Martyn Welch <martyn.welch@gefanuc.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2010-02-17 21:17:07 -06:00
Malcolm Crossley
9b952a3970 powerpc/86xx: Fix GE SBC310 XMC site support
Correction to interrupt map mask for GE SBC310 XMC site and addition of
alias.

Signed-off-by: Malcolm Crossley <malcolm.crossley2@gefanuc.com>
Signed-off-by: Martyn Welch <martyn.welch@gefanuc.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2010-02-17 21:17:05 -06:00
Martyn Welch
f5d570d32c powerpc/86xx: Add MSI section to GE SBC310 DTS
Add the MSI section to the DTS file for the GE SBC310.

Signed-off-by: Martyn Welch <martyn.welch@gefanuc.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2010-02-17 21:17:03 -06:00
Sebastian Andrzej Siewior
51adc548cb powerpc/fsl-booke: replace a hardcoded constant
24 is offset between the opcode past bl and past rfi. This makes it more
obvious.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2010-02-17 21:10:25 -06:00
Benjamin Herrenschmidt
efd0f0f385 Merge commit 'jwb/next' into next 2010-02-18 09:34:38 +11:00
Dave Kleikamp
3bffb6529c powerpc/booke: Add support for advanced debug registers
powerpc/booke: Add support for advanced debug registers

From: Dave Kleikamp <shaggy@linux.vnet.ibm.com>

Based on patches originally written by Torez Smith.

This patch defines context switch and trap related functionality
for BookE specific Debug Registers. It adds support to ptrace()
for setting and getting BookE related Debug Registers

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Torez Smith  <lnxtorez@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <dwg@au1.ibm.com>
Cc: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Sergio Durigan Junior <sergiodj@br.ibm.com>
Cc: Thiago Jung Bauermann <bauerman@br.ibm.com>
Cc: linuxppc-dev list <Linuxppc-dev@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:03:17 +11:00
Dave Kleikamp
99396ac105 powerpc/booke: Add definitions for advanced debug registers
powerpc/booke: Add definitions for advanced debug registers

From: Dave Kleikamp <shaggy@linux.vnet.ibm.com>

Based on patches originally written by Torez Smith.

This patch adds additional definitions for BookE Debug Registers
to the reg_booke.h header file.

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Acked-by: David Gibson <dwg@au1.ibm.com>
Cc: Torez Smith  <lnxtorez@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Sergio Durigan Junior <sergiodj@br.ibm.com>
Cc: Thiago Jung Bauermann <bauerman@br.ibm.com>
Cc: linuxppc-dev list <Linuxppc-dev@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:03:17 +11:00
Dave Kleikamp
3162d92dfb powerpc: Extended ptrace interface
powerpc: Extended ptrace interface

From: Dave Kleikamp <shaggy@linux.vnet.ibm.com>

Based on patches originally written by Torez Smith.

Add a new extended ptrace interface so that user-space has a single
interface for powerpc, without having to know the specific layout
of the debug registers.

Implement:
PPC_PTRACE_GETHWDEBUGINFO
PPC_PTRACE_SETHWDEBUG
PPC_PTRACE_DELHWDEBUG

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Acked-by: David Gibson <dwg@au1.ibm.com>
Cc: Torez Smith  <lnxtorez@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Sergio Durigan Junior <sergiodj@br.ibm.com>
Cc: Thiago Jung Bauermann <bauerman@br.ibm.com>
Cc: linuxppc-dev list <Linuxppc-dev@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:03:17 +11:00
Dave Kleikamp
172ae2e7f8 powerpc/booke: Introduce new CONFIG options for advanced debug registers
powerpc/booke: Introduce new CONFIG options for advanced debug registers

From: Dave Kleikamp <shaggy@linux.vnet.ibm.com>

Introduce new config options to simplify the ifdefs pertaining to the
advanced debug registers for booke and 40x processors:

CONFIG_PPC_ADV_DEBUG_REGS - boolean: true for dac-based processors
CONFIG_PPC_ADV_DEBUG_IACS - number of IAC registers
CONFIG_PPC_ADV_DEBUG_DACS - number of DAC registers
CONFIG_PPC_ADV_DEBUG_DVCS - number of DVC registers
CONFIG_PPC_ADV_DEBUG_DAC_RANGE - DAC ranges supported

Beginning conservatively, since I only have the facilities to test 440
hardware.  I believe all 40x and booke platforms support at least 2 IAC
and 2 DAC registers.  For 440, 4 IAC and 2 DVC registers are enabled, as
well as the DAC ranges.

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Acked-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:03:16 +11:00
Anton Blanchard
789c299ca2 powerpc: Improve 64bit copy_tofrom_user
Here is a patch from Paul Mackerras that improves the ppc64 copy_tofrom_user.
The loop now does 32 bytes at a time and as well as pairing loads and stores.

A quick test case that reads 8kB over and over shows the improvement:

POWER6: 53% faster
POWER7: 51% faster

#define _XOPEN_SOURCE 500
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>

#define BUFSIZE (8 * 1024)
#define ITERATIONS 10000000

int main()
{
	char tmpfile[] = "/tmp/copy_to_user_testXXXXXX";
	int fd;
	char *buf[BUFSIZE];
	unsigned long i;

	fd = mkstemp(tmpfile);
	if (fd < 0) {
		perror("open");
		exit(1);
	}

	if (write(fd, buf, BUFSIZE) != BUFSIZE) {
		perror("open");
		exit(1);
	}

	for (i = 0; i < 10000000; i++) {
		if (pread(fd, buf, BUFSIZE, 0) != BUFSIZE) {
			perror("pread");
			exit(1);
		}
	}

	unlink(tmpfile);

	return 0;
}

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:03:16 +11:00
Anton Blanchard
63e6c5b810 powerpc: Pair loads and stores in copy_4k_page
A number of our chips like loads and stores to be paired. A small kernel
module testcase shows the improvement of pairing loads and stores in
copy_4k_page:

POWER6: +9%
POWER7: +1.5%

#include <linux/module.h>
#include <linux/mm.h>

#define ITERATIONS 10000000

static int __init copypage_init(void)
{
	struct timespec before, after;
	unsigned long i;
	struct page *destpage, *srcpage;
	char *dest, *src;

	destpage = alloc_page(GFP_KERNEL);
	srcpage = alloc_page(GFP_KERNEL);

	dest = page_address(destpage);
	src = page_address(srcpage);

	getnstimeofday(&before);

	for (i = 0; i < ITERATIONS; i++)
		copy_4K_page(dest, src);

	getnstimeofday(&after);

	free_page((unsigned long)dest);
	free_page((unsigned long)src);

	printk(KERN_DEBUG "copy_4K_page loop took %lu ns\n",
		(after.tv_sec - before.tv_sec) * NSEC_PER_SEC +
		(after.tv_nsec - before.tv_nsec));

	return 0;
}

static void __exit copypage_exit(void)
{
}

module_init(copypage_init)
module_exit(copypage_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anton Blanchard");

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:03:16 +11:00
Anton Blanchard
5a0e9b5718 powerpc: Use lwsync for acquire barrier if CPU supports it
Nick Piggin discovered that lwsync barriers around locks were faster than isync
on 970. That was a long time ago and I completely dropped the ball in testing
his patches across other ppc64 processors.

Turns out the idea helps on other chips. Using a microbenchmark that
uses a lot of threads to contend on a global pthread mutex (and therefore a
global futex), POWER6 improves 8% and POWER7 improves 2%. I checked POWER5
and while I couldn't measure an improvement, there was no regression.

This patch uses the lwsync patching code to replace the isyncs with lwsyncs
on CPUs that support the instruction. We were marking POWER3 and RS64 as lwsync
capable but in reality they treat it as a full sync (ie slow). Remove the
CPU_FTR_LWSYNC bit from these CPUs so they continue to use the faster isync
method.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:03:16 +11:00
Anton Blanchard
53eae2281a powerpc: Fix lwsync patching code on 64bit
do_lwsync_fixups doesn't work on 64bit, we end up writing lwsyncs to the
wrong addresses:

0:mon> di c0000001000bfacc
c0000001000bfacc  7c2004ac      lwsync

Since the lwsync section has negative offsets we need to use a signed int
pointer so we sign extend the value.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:03:15 +11:00
Anton Blanchard
f10e2e5b4b powerpc: Rename LWSYNC_ON_SMP to PPC_RELEASE_BARRIER, ISYNC_ON_SMP to PPC_ACQUIRE_BARRIER
For performance reasons we are about to change ISYNC_ON_SMP to sometimes be
lwsync. Now that the macro name doesn't make sense, change it and LWSYNC_ON_SMP
to better explain what the barriers are doing.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:03:15 +11:00
Anton Blanchard
66d99b8834 powerpc: Convert open coded native hashtable bit lock
Now we have real bit locks use them instead of open coding it.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:03:15 +11:00
Anton Blanchard
864b9e6fd7 powerpc: Use lwarx/ldarx hint in bit locks
This patch implements the lwarx/ldarx hint bit for bit locks.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:03:15 +11:00
Anton Blanchard
4e14a4d17a powerpc: Use lwarx hint in spinlocks
Recent versions of the PowerPC architecture added a hint bit to the larx
instructions to differentiate between an atomic operation and a lock operation:

> 0 Other programs might attempt to modify the word in storage addressed by EA
> even if the subsequent Store Conditional succeeds.
>
> 1 Other programs will not attempt to modify the word in storage addressed by
> EA until the program that has acquired the lock performs a subsequent store
> releasing the lock.

To avoid a binutils dependency this patch create macros for the extended lwarx
format and uses it in the spinlock code. To test this change I used a simple
test case that acquires and releases a global pthread mutex:

	pthread_mutex_lock(&mutex);
	pthread_mutex_unlock(&mutex);

On a 32 core POWER6, running 32 test threads we spend almost all our time in
the futex spinlock code:

    94.37%     perf  [kernel]                     [k] ._raw_spin_lock
               |
               |--99.95%-- ._raw_spin_lock
               |          |
               |          |--63.29%-- .futex_wake
               |          |
               |          |--36.64%-- .futex_wait_setup

Which is a good test for this patch. The results (in lock/unlock operations per
second) are:

before: 1538203 ops/sec
after:  2189219 ops/sec

An improvement of 42%

A 32 core POWER7 improves even more:

before: 1279529 ops/sec
after:  2282076 ops/sec

An improvement of 78%

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:03:14 +11:00
Anton Blanchard
17081102a6 powerpc: Convert global "BAD" interrupt to per cpu spurious
I often get asked if BAD interrupts are really bad. On some boxes (eg
IBM machines running a hypervisor) there are valid cases where are
presented with an interrupt that is not for us. These cases are common
enough to show up as thousands of BAD interrupts a day.

Tone them down by calling them spurious. Since they can be a significant cause
of OS jitter, we may as well log them per cpu so we know where they are
occurring.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:02:49 +11:00
Anton Blanchard
89713ed108 powerpc: Add timer, performance monitor and machine check counts to /proc/interrupts
With NO_HZ it is useful to know how often the decrementer is going off. The
patch below adds an entry for it and also adds it into the /proc/stat
summaries.

While here, I added performance monitoring and machine check exceptions.
I found it useful to keep an eye on the PMU exception rate
when using the perf tool. Since it's possible to take a completely
handled machine check on a System p box it also sounds like a good idea to
keep a machine check summary.

The event naming matches x86 to keep gratuitous differences to a minimum.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:02:49 +11:00
Anton Blanchard
fc380c0c8a powerpc: Remove whitespace in irq chip name fields
Now we use printf style alignment there is no need to manually space
these fields.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:02:48 +11:00
Anton Blanchard
c86845ede8 powerpc: Rework /proc/interrupts
On a large machine I noticed the columns of /proc/interrupts failed to line up
with the header after CPU9. At sufficiently large numbers of CPUs it becomes
impossible to line up the CPU number with the counts.

While fixing this I noticed x86 has a number of updates that we may as well
pull in. On PowerPC we currently omit an interrupt completely if there is no
active handler, whereas on x86 it is printed if there is a non zero count.

The x86 code also spaces the first column correctly based on nr_irqs.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:02:48 +11:00
Anton Blanchard
fda9d86100 powerpc: Reduce footprint of xics_ipi_struct
Right now we allocate a cacheline sized NR_CPUS array for xics IPI
communication. Use DECLARE_PER_CPU_SHARED_ALIGNED to put it in percpu
data in its own cacheline since it is written to by other cpus.

On a kernel with NR_CPUS=1024, this saves quite a lot of memory:

   text    data     bss      dec         hex    filename
8767779 2944260 1505724 13217763         c9afe3 vmlinux.irq_cpustat
8767555 2813444 1505724 13086723         c7b003 vmlinux.xics

A saving of around 128kB.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:02:48 +11:00
Anton Blanchard
8c007bfdf1 powerpc: Reduce footprint of irq_stat
PowerPC is currently using asm-generic/hardirq.h which statically allocates an
NR_CPUS irq_stat array. Switch to an arch specific implementation which uses
per cpu data:

On a kernel with NR_CPUS=1024, this saves quite a lot of memory:

   text    data     bss      dec         hex    filename
8767938 2944132 1636796 13348866         cbb002 vmlinux.baseline
8767779 2944260 1505724 13217763         c9afe3 vmlinux.irq_cpustat

A saving of around 128kB.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:02:48 +11:00
Breno Leitao
8d3d50bf19 powerpc/eeh: Fix a bug when pci structure is null
During a EEH recover, the pci_dev structure can be null, mainly if an
eeh event is detected during cpi config operation. In this case, the
pci_dev will not be known (and will be null) the kernel will crash
with the following message:

Unable to handle kernel paging request for data at address 0x000000a0
Faulting instruction address: 0xc00000000006b8b4
Oops: Kernel access of bad area, sig: 11 [#1]

NIP [c00000000006b8b4] .eeh_event_handler+0x10c/0x1a0
LR [c00000000006b8a8] .eeh_event_handler+0x100/0x1a0
Call Trace:
[c0000003a80dff00] [c00000000006b8a8] .eeh_event_handler+0x100/0x1a0
[c0000003a80dff90] [c000000000031f1c] .kernel_thread+0x54/0x70

The bug occurs because pci_name() tries to access a null pointer.
This patch just guarantee that pci_name() is not called on Null pointers.

Signed-off-by: Breno Leitao <leitao@linux.vnet.ibm.com>
Signed-off-by: Linas Vepstas <linasvepstas@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:02:47 +11:00
Corey Minyard
e0508b1516 powerpc: Add coherent_dma_mask to mv64x60 devices
DMA ops requires that coherent_dma_mask be set properly for a device,
but this was not being done for devices on the MV64x60 that use DMA.
Both the serial and ethernet devices need this or they won't be able
to allocate memory.

Signed-off-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:02:47 +11:00
Benjamin Herrenschmidt
ec144a81ad Merge commit 'origin/master' into next 2010-02-17 10:00:42 +11:00
Linus Torvalds
8862627254 Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
  dm: sysfs revert add empty release function to avoid debug warning
  dm mpath: fix stall when requeueing io
  dm raid1: fix null pointer dereference in suspend
  dm raid1: fail writes if errors are not handled and log fails
  dm log: userspace fix overhead_size calcuations
  dm snapshot: persistent annotate work_queue as on stack
  dm stripe: avoid divide by zero with invalid stripe count
2010-02-16 12:22:15 -08:00
Linus Torvalds
5ae1d95568 Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] preserve personality flag bits across exec
2010-02-16 11:59:01 -08:00
Alasdair G Kergon
9307f6b19a dm: sysfs revert add empty release function to avoid debug warning
Revert commit d2bb7df8ca at Greg's request.

    Author: Milan Broz <mbroz@redhat.com>
    Date:   Thu Dec 10 23:51:53 2009 +0000

    dm: sysfs add empty release function to avoid debug warning

    This patch just removes an unnecessary warning:
     kobject: 'dm': does not have a release() function,
     it is broken and must be fixed.

    The kobject is embedded in mapped device struct, so
    code does not need to release memory explicitly here.

Cc: Greg KH <gregkh@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-02-16 18:43:04 +00:00
Kiyoshi Ueda
9eef87da2a dm mpath: fix stall when requeueing io
This patch fixes the problem that system may stall if target's ->map_rq
returns DM_MAPIO_REQUEUE in map_request().
E.g. stall happens on 1 CPU box when a dm-mpath device with queue_if_no_path
     bounces between all-paths-down and paths-up on I/O load.

When target's ->map_rq returns DM_MAPIO_REQUEUE, map_request() requeues
the request and returns to dm_request_fn().  Then, dm_request_fn()
doesn't exit the I/O dispatching loop and continues processing
the requeued request again.
This map and requeue loop can be done with interrupt disabled,
so 1 CPU system can be stalled if this situation happens.

For example, commands below can stall my 1 CPU box within 1 minute or so:
  # dmsetup table mp
  mp: 0 2097152 multipath 1 queue_if_no_path 0 1 1 service-time 0 1 2 8:144 1 1
  # while true; do dd if=/dev/mapper/mp of=/dev/null bs=1M count=100; done &
  # while true; do \
  > dmsetup message mp 0 "fail_path 8:144" \
  > dmsetup suspend --noflush mp \
  > dmsetup resume mp \
  > dmsetup message mp 0 "reinstate_path 8:144" \
  > done

To fix the problem above, this patch changes dm_request_fn() to exit
the I/O dispatching loop once if a request is requeued in map_request().

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-02-16 18:43:01 +00:00
Takahiro Yasui
558569aa9d dm raid1: fix null pointer dereference in suspend
When suspending a failed mirror, bios are completed by mirror_end_io() and
__rh_lookup() in dm_rh_dec() returns NULL where a non-NULL return value is
required by design.  Fix this by not changing the state of the recovery failed
region from DM_RH_RECOVERING to DM_RH_NOSYNC in dm_rh_recovery_end().

Issue

On 2.6.33-rc1 kernel, I hit the bug when I suspended the failed
mirror by dmsetup command.

BUG: unable to handle kernel NULL pointer dereference at 00000020
IP: [<f94f38e2>] dm_rh_dec+0x35/0xa1 [dm_region_hash]
...
EIP: 0060:[<f94f38e2>] EFLAGS: 00010046 CPU: 0
EIP is at dm_rh_dec+0x35/0xa1 [dm_region_hash]
EAX: 00000286 EBX: 00000000 ECX: 00000286 EDX: 00000000
ESI: eff79eac EDI: eff79e80 EBP: f6915cd4 ESP: f6915cc4
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process dmsetup (pid: 2849, ti=f6914000 task=eff03e80 task.ti=f6914000)
 ...
Call Trace:
 [<f9530af6>] ? mirror_end_io+0x53/0x1b1 [dm_mirror]
 [<f9413104>] ? clone_endio+0x4d/0xa2 [dm_mod]
 [<f9530aa3>] ? mirror_end_io+0x0/0x1b1 [dm_mirror]
 [<f94130b7>] ? clone_endio+0x0/0xa2 [dm_mod]
 [<c02d6bcb>] ? bio_endio+0x28/0x2b
 [<f952f303>] ? hold_bio+0x2d/0x62 [dm_mirror]
 [<f952f942>] ? mirror_presuspend+0xeb/0xf7 [dm_mirror]
 [<c02aa3e2>] ? vmap_page_range+0xb/0xd
 [<f9414c8d>] ? suspend_targets+0x2d/0x3b [dm_mod]
 [<f9414ca9>] ? dm_table_presuspend_targets+0xe/0x10 [dm_mod]
 [<f941456f>] ? dm_suspend+0x4d/0x150 [dm_mod]
 [<f941767d>] ? dev_suspend+0x55/0x18a [dm_mod]
 [<c0343762>] ? _copy_from_user+0x42/0x56
 [<f9417fb0>] ? dm_ctl_ioctl+0x22c/0x281 [dm_mod]
 [<f9417628>] ? dev_suspend+0x0/0x18a [dm_mod]
 [<f9417d84>] ? dm_ctl_ioctl+0x0/0x281 [dm_mod]
 [<c02c3c4b>] ? vfs_ioctl+0x22/0x85
 [<c02c422c>] ? do_vfs_ioctl+0x4cb/0x516
 [<c02c42b7>] ? sys_ioctl+0x40/0x5a
 [<c0202858>] ? sysenter_do_call+0x12/0x28

Analysis

When recovery process of a region failed, dm_rh_recovery_end() function
changes the state of the region from RM_RH_RECOVERING to DM_RH_NOSYNC.
When recovery_complete() is executed between dm_rh_update_states() and
dm_writes() in do_mirror(), bios are processed with the region state,
DM_RH_NOSYNC. However, the region data is freed without checking its
pending count when dm_rh_update_states() is called next time.

When bios are finished by mirror_end_io(), __rh_lookup() in dm_rh_dec()
returns NULL even though a valid return value are expected.

Solution

Remove the state change of the recovery failed region from DM_RH_RECOVERING
to DM_RH_NOSYNC in dm_rh_recovery_end(). We can remove the state change
because:

  - If the region data has been released by dm_rh_update_states(),
    a new region data is created with the state of DM_RH_NOSYNC, and
    bios are processed according to the DM_RH_NOSYNC state.

  - If the region data has not been released by dm_rh_update_states(),
    a state of the region is DM_RH_RECOVERING and bios are put in the
    delayed_bio list.

The flag change from DM_RH_RECOVERING to DM_RH_NOSYNC in dm_rh_recovery_end()
was added in the following commit:
  dm raid1: handle resync failures
  author  Jonathan Brassow <jbrassow@redhat.com>
    Thu, 12 Jul 2007 16:29:04 +0000 (17:29 +0100)
  http://git.kernel.org/linus/f44db678edcc6f4c2779ac43f63f0b9dfa28b724

Signed-off-by: Takahiro Yasui <tyasui@redhat.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-02-16 18:42:58 +00:00
Mikulas Patocka
5528d17de1 dm raid1: fail writes if errors are not handled and log fails
If the mirror log fails when the handle_errors option was not selected
and there is no remaining valid mirror leg, writes return success even
though they weren't actually written to any device.  This patch
completes them with EIO instead.

This code path is taken:
do_writes:
	bio_list_merge(&ms->failures, &sync);
do_failures:
	if (!get_valid_mirror(ms)) (false)
	else if (errors_handled(ms)) (false)
	else bio_endio(bio, 0);

The logic in do_failures is based on presuming that the write was already
tried: if it succeeded at least on one leg (without handle_errors) it
is reported as success.

Reference: https://bugzilla.redhat.com/show_bug.cgi?id=555197

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-02-16 18:42:55 +00:00
Jonathan Brassow
ebfd32bba9 dm log: userspace fix overhead_size calcuations
This patch fixes two bugs that revolve around the miscalculation and
misuse of the variable 'overhead_size'.  'overhead_size' is the size of
the various header structures used during communication.

The first bug is the use of 'sizeof' with the pointer of a structure
instead of the structure itself - resulting in the wrong size being
computed.  This is then used in a check to see if the payload
(data_size) would be to large for the preallocated structure.  Since the
bug produces a smaller value for the overhead, it was possible for the
structure to be breached.  (Although the current users of the code do
not currently send enough data to trigger this bug.)

The second bug is that the 'overhead_size' value is used to compute how
much of the preallocated space should be cleared before populating it
with fresh data.  This should have simply been 'sizeof(struct cn_msg)'
not overhead_size.  The fact that 'overhead_size' was computed
incorrectly made this problem "less bad" - leaving only a pointer's
worth of space at the end uncleared.  Thus, this bug was never producing
a bad result, but still needs to be fixed - especially now that the
value is computed correctly.

Cc: stable@kernel.org
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-02-16 18:42:53 +00:00
Mike Snitzer
55f67f2ded dm snapshot: persistent annotate work_queue as on stack
chunk_io() declares its 'struct mdata_req' on the stack and then
initializes its 'struct work_struct' member.  Annotate the
initialization of this workqueue with INIT_WORK_ON_STACK to suppress a
debugobjects warning seen when CONFIG_DEBUG_OBJECTS_WORK is enabled.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-02-16 18:42:51 +00:00
Nikanth Karthikesan
781248c1b5 dm stripe: avoid divide by zero with invalid stripe count
If a table containing zero as stripe count is passed into stripe_ctr
the code attempts to divide by zero.

This patch changes DM_TABLE_LOAD to return -EINVAL if the stripe count
is zero.

We now get the following error messages:
  device-mapper: table: 253:0: striped: Invalid stripe count
  device-mapper: ioctl: error adding target to table

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-02-16 18:42:47 +00:00
Oleg Nesterov
11557b24fd x86: ELF_PLAT_INIT() shouldn't worry about TIF_IA32
The 64-bit version of ELF_PLAT_INIT() clears TIF_IA32, but at this point
it has already been cleared by SET_PERSONALITY == set_personality_64bit.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-16 08:51:49 -08:00
Oleg Nesterov
1252f238db x86: set_personality_ia32() misses force_personality32
05d43ed8a "x86: get rid of the insane TIF_ABI_PENDING bit" forgot about
force_personality32.  Fix.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-16 08:50:28 -08:00
Linus Torvalds
0813e22d4e Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: btrfs_mark_extent_written uses the wrong slot
2010-02-15 19:56:21 -08:00
Linus Torvalds
382640b337 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6:
  firewire: ohci: retransmit isochronous transmit packets on cycle loss
  firewire: net: fix panic in fwnet_write_complete
2010-02-15 19:54:54 -08:00
Linus Torvalds
d277993f78 Merge branch 'fix/hda' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6
* 'fix/hda' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
  ALSA: hda - Correct ASUA blacklist for MSI brokenness
2010-02-15 19:54:18 -08:00
Chuck Lever
65d269538a NFS: Too many GETATTR and ACCESS calls after direct I/O
The cached read and write paths initialize fattr->time_start in their
setup procedures.  The value of fattr->time_start is propagated to
read_cache_jiffies by nfs_update_inode().  Subsequent calls to
nfs_attribute_timeout() will then use a good time stamp when
computing the attribute cache timeout, and squelch unneeded GETATTR
calls.

Since the direct I/O paths erroneously leave the inode's
fattr->time_start field set to zero, read_cache_jiffies for that inode
is set to zero after any direct read or write operation.  This
triggers an otw GETATTR or ACCESS call to update the file's attribute
and access caches properly, even when the NFS READ or WRITE replies
have usable post-op attributes.

Make sure the direct read and write setup code performs the same fattr
initialization as the cached I/O paths to prevent unnecessary GETATTR
calls.

This was likely introduced by commit 0e574af1 in 2.6.15, which appears
to add new nfs_fattr_init() call sites in the cached read and write
paths, but not in the equivalent places in fs/nfs/direct.c.  A
subsequent commit in the same series, 33801147, introduces the
fattr->time_start field.

Interestingly, the direct write reschedule path already has a call to
nfs_fattr_init() in the right place.

Reported-by: Quentin Barnes <qbarnes@yahoo-inc.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-15 19:53:43 -08:00
Linus Torvalds
7d0bab9dfe Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  hrtimer, softirq: Fix hrtimer->softirq trampoline
2010-02-15 19:52:12 -08:00
Linus Torvalds
0aa2ca9ae1 Merge branch 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing
* 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
  reiserfs: Fix softlockup while waiting on an inode
2010-02-15 19:51:45 -08:00
Linus Torvalds
76212a840f Merge branch 'drm-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6
* 'drm-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
  drm/radeon/kms: make sure retry count increases.
  drm/radeon/kms/atom: use get_unaligned_le32() for ctx->ps
  drm/ttm: Fix a bug occuring when validating a buffer object in a range.
  drm: Fix a bug in the range manager.
2010-02-15 19:51:15 -08:00
Linus Torvalds
e04984c839 Merge branch 'sh/for-2.6.33' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6
* 'sh/for-2.6.33' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6:
  sh64: fix tracing of signals.
2010-02-15 19:50:34 -08:00