2019-05-31 16:09:56 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2006-01-17 00:50:04 +08:00
|
|
|
/*
|
|
|
|
* Copyright (C) Sistina Software, Inc. 1997-2003 All rights reserved.
|
2008-02-01 00:31:39 +08:00
|
|
|
* Copyright (C) 2004-2008 Red Hat, Inc. All rights reserved.
|
2006-01-17 00:50:04 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/spinlock.h>
|
|
|
|
#include <linux/completion.h>
|
|
|
|
#include <linux/buffer_head.h>
|
2006-02-28 06:23:27 +08:00
|
|
|
#include <linux/gfs2_ondisk.h>
|
2008-05-22 00:03:22 +08:00
|
|
|
#include <linux/bio.h>
|
2009-10-02 18:54:39 +08:00
|
|
|
#include <linux/posix_acl.h>
|
2015-12-25 00:09:40 +08:00
|
|
|
#include <linux/security.h>
|
2006-01-17 00:50:04 +08:00
|
|
|
|
|
|
|
#include "gfs2.h"
|
2006-02-28 06:23:27 +08:00
|
|
|
#include "incore.h"
|
2006-01-17 00:50:04 +08:00
|
|
|
#include "bmap.h"
|
|
|
|
#include "glock.h"
|
|
|
|
#include "glops.h"
|
|
|
|
#include "inode.h"
|
|
|
|
#include "log.h"
|
|
|
|
#include "meta_io.h"
|
|
|
|
#include "recovery.h"
|
|
|
|
#include "rgrp.h"
|
2006-02-28 06:23:27 +08:00
|
|
|
#include "util.h"
|
2006-10-03 23:10:41 +08:00
|
|
|
#include "trans.h"
|
2011-06-15 17:29:37 +08:00
|
|
|
#include "dir.h"
|
2019-05-03 03:17:40 +08:00
|
|
|
#include "lops.h"
|
2006-01-17 00:50:04 +08:00
|
|
|
|
2014-11-14 10:42:04 +08:00
|
|
|
struct workqueue_struct *gfs2_freeze_wq;
|
|
|
|
|
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-29 03:23:45 +08:00
|
|
|
extern struct workqueue_struct *gfs2_control_wq;
|
|
|
|
|
2011-08-02 20:09:36 +08:00
|
|
|
static void gfs2_ail_error(struct gfs2_glock *gl, const struct buffer_head *bh)
|
|
|
|
{
|
2021-05-25 00:51:26 +08:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
|
|
|
|
|
|
|
fs_err(sdp,
|
2015-03-17 00:52:05 +08:00
|
|
|
"AIL buffer %p: blocknr %llu state 0x%08lx mapping %p page "
|
|
|
|
"state 0x%lx\n",
|
2011-08-02 20:09:36 +08:00
|
|
|
bh, (unsigned long long)bh->b_blocknr, bh->b_state,
|
2022-12-16 05:43:58 +08:00
|
|
|
bh->b_folio->mapping, bh->b_folio->flags);
|
2021-05-25 00:51:26 +08:00
|
|
|
fs_err(sdp, "AIL glock %u:%llu mapping %p\n",
|
2011-08-02 20:09:36 +08:00
|
|
|
gl->gl_name.ln_type, gl->gl_name.ln_number,
|
|
|
|
gfs2_glock2aspace(gl));
|
2021-05-25 00:51:26 +08:00
|
|
|
gfs2_lm(sdp, "AIL error\n");
|
2021-07-31 02:23:49 +08:00
|
|
|
gfs2_withdraw_delayed(sdp);
|
2011-08-02 20:09:36 +08:00
|
|
|
}
|
|
|
|
|
2006-10-03 23:10:41 +08:00
|
|
|
/**
|
2011-04-14 16:54:02 +08:00
|
|
|
* __gfs2_ail_flush - remove all buffers for a given lock from the AIL
|
2006-10-03 23:10:41 +08:00
|
|
|
* @gl: the glock
|
2011-09-07 17:33:25 +08:00
|
|
|
* @fsync: set when called from fsync (not all buffers will be clean)
|
2021-03-31 00:44:29 +08:00
|
|
|
* @nr_revokes: Number of buffers to revoke
|
2006-10-03 23:10:41 +08:00
|
|
|
*
|
|
|
|
* None of the buffers should be dirty, locked, or pinned.
|
|
|
|
*/
|
|
|
|
|
2013-07-27 06:09:33 +08:00
|
|
|
static void __gfs2_ail_flush(struct gfs2_glock *gl, bool fsync,
|
|
|
|
unsigned int nr_revokes)
|
2006-10-03 23:10:41 +08:00
|
|
|
{
|
2015-03-17 00:52:05 +08:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2006-10-03 23:10:41 +08:00
|
|
|
struct list_head *head = &gl->gl_ail_list;
|
2011-09-07 17:33:25 +08:00
|
|
|
struct gfs2_bufdata *bd, *tmp;
|
2006-10-03 23:10:41 +08:00
|
|
|
struct buffer_head *bh;
|
2011-09-07 17:33:25 +08:00
|
|
|
const unsigned long b_state = (1UL << BH_Dirty)|(1UL << BH_Pinned)|(1UL << BH_Lock);
|
2009-02-05 18:12:38 +08:00
|
|
|
|
2011-09-07 17:33:25 +08:00
|
|
|
gfs2_log_lock(sdp);
|
2011-03-11 19:52:25 +08:00
|
|
|
spin_lock(&sdp->sd_ail_lock);
|
2013-07-27 06:09:33 +08:00
|
|
|
list_for_each_entry_safe_reverse(bd, tmp, head, bd_ail_gl_list) {
|
|
|
|
if (nr_revokes == 0)
|
|
|
|
break;
|
2006-10-03 23:10:41 +08:00
|
|
|
bh = bd->bd_bh;
|
2011-09-07 17:33:25 +08:00
|
|
|
if (bh->b_state & b_state) {
|
|
|
|
if (fsync)
|
|
|
|
continue;
|
2011-08-02 20:09:36 +08:00
|
|
|
gfs2_ail_error(gl, bh);
|
2011-09-07 17:33:25 +08:00
|
|
|
}
|
2007-09-03 18:01:33 +08:00
|
|
|
gfs2_trans_add_revoke(sdp, bd);
|
2013-07-27 06:09:33 +08:00
|
|
|
nr_revokes--;
|
2006-10-03 23:10:41 +08:00
|
|
|
}
|
2012-10-15 17:57:02 +08:00
|
|
|
GLOCK_BUG_ON(gl, !fsync && atomic_read(&gl->gl_ail_count));
|
2011-03-11 19:52:25 +08:00
|
|
|
spin_unlock(&sdp->sd_ail_lock);
|
2011-09-07 17:33:25 +08:00
|
|
|
gfs2_log_unlock(sdp);
|
2024-01-26 18:49:44 +08:00
|
|
|
|
|
|
|
if (gfs2_withdrawing(sdp))
|
|
|
|
gfs2_withdraw(sdp);
|
2011-04-14 16:54:02 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2019-11-14 04:09:28 +08:00
|
|
|
static int gfs2_ail_empty_gl(struct gfs2_glock *gl)
|
2011-04-14 16:54:02 +08:00
|
|
|
{
|
2015-03-17 00:52:05 +08:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2011-04-14 16:54:02 +08:00
|
|
|
struct gfs2_trans tr;
|
2021-01-29 23:45:33 +08:00
|
|
|
unsigned int revokes;
|
2023-04-22 03:07:07 +08:00
|
|
|
int ret = 0;
|
2011-04-14 16:54:02 +08:00
|
|
|
|
2021-01-29 23:45:33 +08:00
|
|
|
revokes = atomic_read(&gl->gl_ail_count);
|
2011-04-14 16:54:02 +08:00
|
|
|
|
2021-01-29 23:45:33 +08:00
|
|
|
if (!revokes) {
|
2019-11-14 03:47:02 +08:00
|
|
|
bool have_revokes;
|
|
|
|
bool log_in_flight;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We have nothing on the ail, but there could be revokes on
|
|
|
|
* the sdp revoke queue, in which case, we still want to flush
|
|
|
|
* the log and wait for it to finish.
|
|
|
|
*
|
|
|
|
* If the sdp revoke list is empty too, we might still have an
|
|
|
|
* io outstanding for writing revokes, so we should wait for
|
|
|
|
* it before returning.
|
|
|
|
*
|
|
|
|
* If none of these conditions are true, our revokes are all
|
|
|
|
* flushed and we can return.
|
|
|
|
*/
|
|
|
|
gfs2_log_lock(sdp);
|
|
|
|
have_revokes = !list_empty(&sdp->sd_log_revokes);
|
|
|
|
log_in_flight = atomic_read(&sdp->sd_log_in_flight);
|
|
|
|
gfs2_log_unlock(sdp);
|
|
|
|
if (have_revokes)
|
|
|
|
goto flush;
|
|
|
|
if (log_in_flight)
|
|
|
|
log_flush_wait(sdp);
|
2019-11-14 04:09:28 +08:00
|
|
|
return 0;
|
2019-11-14 03:47:02 +08:00
|
|
|
}
|
2011-04-14 16:54:02 +08:00
|
|
|
|
2021-01-29 23:45:33 +08:00
|
|
|
memset(&tr, 0, sizeof(tr));
|
|
|
|
set_bit(TR_ONSTACK, &tr.tr_flags);
|
|
|
|
ret = __gfs2_trans_begin(&tr, sdp, 0, revokes, _RET_IP_);
|
2023-04-22 03:07:09 +08:00
|
|
|
if (ret) {
|
|
|
|
fs_err(sdp, "Transaction error %d: Unable to write revokes.", ret);
|
2021-01-29 23:45:33 +08:00
|
|
|
goto flush;
|
2023-04-22 03:07:09 +08:00
|
|
|
}
|
2021-01-29 23:45:33 +08:00
|
|
|
__gfs2_ail_flush(gl, 0, revokes);
|
2011-04-14 16:54:02 +08:00
|
|
|
gfs2_trans_end(sdp);
|
2021-01-29 23:45:33 +08:00
|
|
|
|
2019-11-14 03:47:02 +08:00
|
|
|
flush:
|
2023-04-22 03:07:10 +08:00
|
|
|
if (!ret)
|
|
|
|
gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL |
|
|
|
|
GFS2_LFC_AIL_EMPTY_GL);
|
2023-04-22 03:07:07 +08:00
|
|
|
return ret;
|
2011-04-14 16:54:02 +08:00
|
|
|
}
|
2006-10-03 23:10:41 +08:00
|
|
|
|
2011-09-07 17:33:25 +08:00
|
|
|
void gfs2_ail_flush(struct gfs2_glock *gl, bool fsync)
|
2011-04-14 16:54:02 +08:00
|
|
|
{
|
2015-03-17 00:52:05 +08:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2011-04-14 16:54:02 +08:00
|
|
|
unsigned int revokes = atomic_read(&gl->gl_ail_count);
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!revokes)
|
|
|
|
return;
|
|
|
|
|
gfs2: Per-revoke accounting in transactions
In the log, revokes are stored as a revoke descriptor (struct
gfs2_log_descriptor), followed by zero or more additional revoke blocks
(struct gfs2_meta_header). On filesystems with a blocksize of 4k, the
revoke descriptor contains up to 503 revokes, and the metadata blocks
contain up to 509 revokes each. We've so far been reserving space for
revokes in transactions in block granularity, so a lot more space than
necessary was being allocated and then released again.
This patch switches to assigning revokes to transactions individually
instead. Initially, space for the revoke descriptor is reserved and
handed out to transactions. When more revokes than that are reserved,
additional revoke blocks are added. When the log is flushed, the space
for the additional revoke blocks is released, but we keep the space for
the revoke descriptor block allocated.
Transactions may still reserve more revokes than they will actually need
in the end, but now we won't overshoot the target as much, and by only
returning the space for excess revokes at log flush time, we further
reduce the amount of contention between processes.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-12-17 23:14:30 +08:00
|
|
|
ret = gfs2_trans_begin(sdp, 0, revokes);
|
2011-04-14 16:54:02 +08:00
|
|
|
if (ret)
|
|
|
|
return;
|
gfs2: Per-revoke accounting in transactions
In the log, revokes are stored as a revoke descriptor (struct
gfs2_log_descriptor), followed by zero or more additional revoke blocks
(struct gfs2_meta_header). On filesystems with a blocksize of 4k, the
revoke descriptor contains up to 503 revokes, and the metadata blocks
contain up to 509 revokes each. We've so far been reserving space for
revokes in transactions in block granularity, so a lot more space than
necessary was being allocated and then released again.
This patch switches to assigning revokes to transactions individually
instead. Initially, space for the revoke descriptor is reserved and
handed out to transactions. When more revokes than that are reserved,
additional revoke blocks are added. When the log is flushed, the space
for the additional revoke blocks is released, but we keep the space for
the revoke descriptor block allocated.
Transactions may still reserve more revokes than they will actually need
in the end, but now we won't overshoot the target as much, and by only
returning the space for excess revokes at log flush time, we further
reduce the amount of contention between processes.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-12-17 23:14:30 +08:00
|
|
|
__gfs2_ail_flush(gl, fsync, revokes);
|
2006-10-03 23:10:41 +08:00
|
|
|
gfs2_trans_end(sdp);
|
2018-01-08 23:34:17 +08:00
|
|
|
gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL |
|
|
|
|
GFS2_LFC_AIL_FLUSH);
|
2006-10-03 23:10:41 +08:00
|
|
|
}
|
2006-07-26 23:27:10 +08:00
|
|
|
|
2020-10-28 01:29:37 +08:00
|
|
|
/**
|
|
|
|
* gfs2_rgrp_metasync - sync out the metadata of a resource group
|
|
|
|
* @gl: the glock protecting the resource group
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
|
|
|
static int gfs2_rgrp_metasync(struct gfs2_glock *gl)
|
|
|
|
{
|
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
|
|
|
struct address_space *metamapping = &sdp->sd_aspace;
|
|
|
|
struct gfs2_rgrpd *rgd = gfs2_glock2rgrp(gl);
|
|
|
|
const unsigned bsize = sdp->sd_sb.sb_bsize;
|
|
|
|
loff_t start = (rgd->rd_addr * bsize) & PAGE_MASK;
|
|
|
|
loff_t end = PAGE_ALIGN((rgd->rd_addr + rgd->rd_length) * bsize) - 1;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
filemap_fdatawrite_range(metamapping, start, end);
|
|
|
|
error = filemap_fdatawait_range(metamapping, start, end);
|
2023-12-21 00:16:29 +08:00
|
|
|
WARN_ON_ONCE(error && !gfs2_withdrawing_or_withdrawn(sdp));
|
2020-10-28 01:29:37 +08:00
|
|
|
mapping_set_error(metamapping, error);
|
|
|
|
if (error)
|
|
|
|
gfs2_io_error(sdp);
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2006-07-26 23:27:10 +08:00
|
|
|
/**
|
2009-03-09 17:03:51 +08:00
|
|
|
* rgrp_go_sync - sync out the metadata for this glock
|
2006-01-17 00:50:04 +08:00
|
|
|
* @gl: the glock
|
|
|
|
*
|
|
|
|
* Called when demoting or unlocking an EX glock. We must flush
|
|
|
|
* to disk all dirty buffers/pages relating to this glock, and must not
|
2017-06-30 20:55:08 +08:00
|
|
|
* return to caller to demote/unlock the glock until I/O is complete.
|
2006-01-17 00:50:04 +08:00
|
|
|
*/
|
|
|
|
|
2019-11-14 04:09:28 +08:00
|
|
|
static int rgrp_go_sync(struct gfs2_glock *gl)
|
2006-01-17 00:50:04 +08:00
|
|
|
{
|
2015-03-17 00:52:05 +08:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
gfs2: Rework how rgrp buffer_heads are managed
Before this patch, the rgrp code had a serious problem related to
how it managed buffer_heads for resource groups. The problem caused
file system corruption, especially in cases of journal replay.
When an rgrp glock was demoted to transfer ownership to a
different cluster node, do_xmote() first calls rgrp_go_sync and then
rgrp_go_inval, as expected. When it calls rgrp_go_sync, that called
gfs2_rgrp_brelse() that dropped the buffer_head reference count.
In most cases, the reference count went to zero, which is right.
However, there were other places where the buffers are handled
differently.
After rgrp_go_sync, do_xmote called rgrp_go_inval which called
gfs2_rgrp_brelse a second time, then rgrp_go_inval's call to
truncate_inode_pages_range would get rid of the pages in memory,
but only if the reference count drops to 0.
Unfortunately, gfs2_rgrp_brelse was setting bi->bi_bh = NULL.
So when rgrp_go_sync called gfs2_rgrp_brelse, it lost the pointer
to the buffer_heads in cases where the reference count was still 1.
Therefore, when rgrp_go_inval called gfs2_rgrp_brelse a second time,
it failed the check for "if (bi->bi_bh)" and thus failed to call
brelse a second time. Because of that, the reference count on those
buffers sometimes failed to drop from 1 to 0. And that caused
function truncate_inode_pages_range to keep the pages in page cache
rather than freeing them.
The next time the rgrp glock was acquired, the metadata read of
the rgrp buffers re-used the pages in memory, which were now
wrong because they were likely modified by the other node who
acquired the glock in EX (which is why we demoted the glock).
This re-use of the page cache caused corruption because changes
made by the other nodes were never seen, so the bitmaps were
inaccurate.
For some reason, the problem became most apparent when journal
replay forced the replay of rgrps in memory, which caused newer
rgrp data to be overwritten by the older in-core pages.
A big part of the problem was that the rgrp buffer were released
in multiple places: The go_unlock function would release them when
the glock was released rather than when the glock is demoted,
which is clearly wrong because our intent was to cache them until
the glock is demoted from SH or EX.
This patch attempts to clean up the mess and make one consistent
and centralized mechanism for managing the rgrp buffer_heads by
implementing several changes:
1. It eliminates the call to gfs2_rgrp_brelse() from rgrp_go_sync.
We don't want to release the buffers or zero the pointers when
syncing for the reasons stated above. It only makes sense to
release them when the glock is actually invalidated (go_inval).
And when we do, then we set the bh pointers to NULL.
2. The go_unlock function (which was only used for rgrps) is
eliminated, as we've talked about doing many times before.
The go_unlock function was called too early in the glock dq
process, and should not happen until the glock is invalidated.
3. It also eliminates the call to rgrp_brelse in gfs2_clear_rgrpd.
That will now happen automatically when the rgrp glocks are
demoted, and shouldn't happen any sooner or later than that.
Instead, function gfs2_clear_rgrpd has been modified to demote
the rgrp glocks, and therefore, free those pages, before the
remaining glocks are culled by gfs2_gl_hash_clear. This
prevents the gl_object from hanging around when the glocks are
culled.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-14 01:50:30 +08:00
|
|
|
struct gfs2_rgrpd *rgd = gfs2_glock2rgrp(gl);
|
2009-03-09 17:03:51 +08:00
|
|
|
int error;
|
|
|
|
|
2023-01-25 03:55:18 +08:00
|
|
|
if (!rgd || !test_and_clear_bit(GLF_DIRTY, &gl->gl_flags))
|
2019-11-14 04:09:28 +08:00
|
|
|
return 0;
|
2012-10-15 17:57:02 +08:00
|
|
|
GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE);
|
2007-01-23 01:15:34 +08:00
|
|
|
|
2018-01-08 23:34:17 +08:00
|
|
|
gfs2_log_flush(sdp, gl, GFS2_LOG_HEAD_FLUSH_NORMAL |
|
|
|
|
GFS2_LFC_RGRP_GO_SYNC);
|
2020-10-28 01:29:37 +08:00
|
|
|
error = gfs2_rgrp_metasync(gl);
|
2019-11-14 04:09:28 +08:00
|
|
|
if (!error)
|
|
|
|
error = gfs2_ail_empty_gl(gl);
|
2020-10-16 00:07:26 +08:00
|
|
|
gfs2_free_clones(rgd);
|
2019-11-14 04:09:28 +08:00
|
|
|
return error;
|
2006-01-17 00:50:04 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2009-03-09 17:03:51 +08:00
|
|
|
* rgrp_go_inval - invalidate the metadata for this glock
|
2006-01-17 00:50:04 +08:00
|
|
|
* @gl: the glock
|
|
|
|
* @flags:
|
|
|
|
*
|
2009-03-09 17:03:51 +08:00
|
|
|
* We never used LM_ST_DEFERRED with resource groups, so that we
|
|
|
|
* should always see the metadata flag set here.
|
|
|
|
*
|
2006-01-17 00:50:04 +08:00
|
|
|
*/
|
|
|
|
|
2009-03-09 17:03:51 +08:00
|
|
|
static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
|
2006-01-17 00:50:04 +08:00
|
|
|
{
|
2015-03-17 00:52:05 +08:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2013-12-07 00:19:54 +08:00
|
|
|
struct address_space *mapping = &sdp->sd_aspace;
|
2017-06-30 20:55:08 +08:00
|
|
|
struct gfs2_rgrpd *rgd = gfs2_glock2rgrp(gl);
|
2020-10-16 00:07:26 +08:00
|
|
|
const unsigned bsize = sdp->sd_sb.sb_bsize;
|
2023-01-25 03:55:18 +08:00
|
|
|
loff_t start, end;
|
2015-06-05 21:38:57 +08:00
|
|
|
|
2023-01-25 03:55:18 +08:00
|
|
|
if (!rgd)
|
|
|
|
return;
|
|
|
|
start = (rgd->rd_addr * bsize) & PAGE_MASK;
|
|
|
|
end = PAGE_ALIGN((rgd->rd_addr + rgd->rd_length) * bsize) - 1;
|
2020-10-16 00:07:26 +08:00
|
|
|
gfs2_rgrp_brelse(rgd);
|
2012-10-15 17:57:02 +08:00
|
|
|
WARN_ON_ONCE(!(flags & DIO_METADATA));
|
2020-10-16 00:07:26 +08:00
|
|
|
truncate_inode_pages_range(mapping, start, end);
|
2006-01-17 00:50:04 +08:00
|
|
|
}
|
|
|
|
|
2023-06-22 04:32:06 +08:00
|
|
|
static void gfs2_rgrp_go_dump(struct seq_file *seq, const struct gfs2_glock *gl,
|
2020-10-07 19:30:58 +08:00
|
|
|
const char *fs_id_buf)
|
|
|
|
{
|
2020-11-23 07:10:24 +08:00
|
|
|
struct gfs2_rgrpd *rgd = gl->gl_object;
|
2020-10-07 19:30:58 +08:00
|
|
|
|
|
|
|
if (rgd)
|
|
|
|
gfs2_rgrp_dump(seq, rgd, fs_id_buf);
|
|
|
|
}
|
|
|
|
|
2017-06-30 20:47:15 +08:00
|
|
|
static struct gfs2_inode *gfs2_glock2inode(struct gfs2_glock *gl)
|
|
|
|
{
|
|
|
|
struct gfs2_inode *ip;
|
|
|
|
|
|
|
|
spin_lock(&gl->gl_lockref.lock);
|
|
|
|
ip = gl->gl_object;
|
|
|
|
if (ip)
|
|
|
|
set_bit(GIF_GLOP_PENDING, &ip->i_flags);
|
|
|
|
spin_unlock(&gl->gl_lockref.lock);
|
|
|
|
return ip;
|
|
|
|
}
|
|
|
|
|
2017-06-30 20:55:08 +08:00
|
|
|
struct gfs2_rgrpd *gfs2_glock2rgrp(struct gfs2_glock *gl)
|
|
|
|
{
|
|
|
|
struct gfs2_rgrpd *rgd;
|
|
|
|
|
|
|
|
spin_lock(&gl->gl_lockref.lock);
|
|
|
|
rgd = gl->gl_object;
|
|
|
|
spin_unlock(&gl->gl_lockref.lock);
|
|
|
|
|
|
|
|
return rgd;
|
|
|
|
}
|
|
|
|
|
2017-06-30 20:47:15 +08:00
|
|
|
static void gfs2_clear_glop_pending(struct gfs2_inode *ip)
|
|
|
|
{
|
|
|
|
if (!ip)
|
|
|
|
return;
|
|
|
|
|
|
|
|
clear_bit_unlock(GIF_GLOP_PENDING, &ip->i_flags);
|
|
|
|
wake_up_bit(&ip->i_flags, GIF_GLOP_PENDING);
|
|
|
|
}
|
|
|
|
|
2007-01-23 01:15:34 +08:00
|
|
|
/**
|
2020-10-28 01:29:37 +08:00
|
|
|
* gfs2_inode_metasync - sync out the metadata of an inode
|
|
|
|
* @gl: the glock protecting the inode
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
int gfs2_inode_metasync(struct gfs2_glock *gl)
|
|
|
|
{
|
|
|
|
struct address_space *metamapping = gfs2_glock2aspace(gl);
|
|
|
|
int error;
|
|
|
|
|
|
|
|
filemap_fdatawrite(metamapping);
|
|
|
|
error = filemap_fdatawait(metamapping);
|
|
|
|
if (error)
|
|
|
|
gfs2_io_error(gl->gl_name.ln_sbd);
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* inode_go_sync - Sync the dirty metadata of an inode
|
2007-01-23 01:15:34 +08:00
|
|
|
* @gl: the glock protecting the inode
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
2019-11-14 04:09:28 +08:00
|
|
|
static int inode_go_sync(struct gfs2_glock *gl)
|
2007-01-23 01:15:34 +08:00
|
|
|
{
|
2017-06-30 20:47:15 +08:00
|
|
|
struct gfs2_inode *ip = gfs2_glock2inode(gl);
|
|
|
|
int isreg = ip && S_ISREG(ip->i_inode.i_mode);
|
2009-12-08 20:12:13 +08:00
|
|
|
struct address_space *metamapping = gfs2_glock2aspace(gl);
|
2020-05-08 22:18:03 +08:00
|
|
|
int error = 0, ret;
|
2007-11-02 16:39:34 +08:00
|
|
|
|
2017-06-30 20:47:15 +08:00
|
|
|
if (isreg) {
|
2013-12-19 19:04:14 +08:00
|
|
|
if (test_and_clear_bit(GIF_SW_PAGED, &ip->i_flags))
|
|
|
|
unmap_shared_mapping_range(ip->i_inode.i_mapping, 0, 0);
|
|
|
|
inode_dio_wait(&ip->i_inode);
|
|
|
|
}
|
2009-03-09 17:03:51 +08:00
|
|
|
if (!test_and_clear_bit(GLF_DIRTY, &gl->gl_flags))
|
2017-06-30 20:47:15 +08:00
|
|
|
goto out;
|
2007-01-23 01:15:34 +08:00
|
|
|
|
2012-10-15 17:57:02 +08:00
|
|
|
GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE);
|
2009-03-09 17:03:51 +08:00
|
|
|
|
2018-01-08 23:34:17 +08:00
|
|
|
gfs2_log_flush(gl->gl_name.ln_sbd, gl, GFS2_LOG_HEAD_FLUSH_NORMAL |
|
|
|
|
GFS2_LFC_INODE_GO_SYNC);
|
2009-03-09 17:03:51 +08:00
|
|
|
filemap_fdatawrite(metamapping);
|
2017-06-30 20:47:15 +08:00
|
|
|
if (isreg) {
|
2009-03-09 17:03:51 +08:00
|
|
|
struct address_space *mapping = ip->i_inode.i_mapping;
|
|
|
|
filemap_fdatawrite(mapping);
|
|
|
|
error = filemap_fdatawait(mapping);
|
|
|
|
mapping_set_error(mapping, error);
|
2007-01-23 01:15:34 +08:00
|
|
|
}
|
2020-10-28 01:29:37 +08:00
|
|
|
ret = gfs2_inode_metasync(gl);
|
2020-05-08 22:18:03 +08:00
|
|
|
if (!error)
|
|
|
|
error = ret;
|
2023-04-22 03:07:07 +08:00
|
|
|
ret = gfs2_ail_empty_gl(gl);
|
|
|
|
if (!error)
|
|
|
|
error = ret;
|
2009-04-20 15:58:45 +08:00
|
|
|
/*
|
|
|
|
* Writeback of the data mapping may cause the dirty flag to be set
|
|
|
|
* so we have to clear it again here.
|
|
|
|
*/
|
2014-03-18 01:06:10 +08:00
|
|
|
smp_mb__before_atomic();
|
2009-04-20 15:58:45 +08:00
|
|
|
clear_bit(GLF_DIRTY, &gl->gl_flags);
|
2017-06-30 20:47:15 +08:00
|
|
|
|
|
|
|
out:
|
|
|
|
gfs2_clear_glop_pending(ip);
|
2019-11-14 04:09:28 +08:00
|
|
|
return error;
|
2007-01-23 01:15:34 +08:00
|
|
|
}
|
|
|
|
|
2006-01-17 00:50:04 +08:00
|
|
|
/**
|
|
|
|
* inode_go_inval - prepare a inode glock to be released
|
|
|
|
* @gl: the glock
|
|
|
|
* @flags:
|
2014-06-29 18:21:39 +08:00
|
|
|
*
|
|
|
|
* Normally we invalidate everything, but if we are moving into
|
2009-03-09 17:03:51 +08:00
|
|
|
* LM_ST_DEFERRED from LM_ST_SHARED or LM_ST_EXCLUSIVE then we
|
|
|
|
* can keep hold of the metadata, since it won't have changed.
|
2006-01-17 00:50:04 +08:00
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
|
|
|
static void inode_go_inval(struct gfs2_glock *gl, int flags)
|
|
|
|
{
|
2017-06-30 20:47:15 +08:00
|
|
|
struct gfs2_inode *ip = gfs2_glock2inode(gl);
|
2006-01-17 00:50:04 +08:00
|
|
|
|
2009-03-09 17:03:51 +08:00
|
|
|
if (flags & DIO_METADATA) {
|
2009-12-08 20:12:13 +08:00
|
|
|
struct address_space *mapping = gfs2_glock2aspace(gl);
|
2009-03-09 17:03:51 +08:00
|
|
|
truncate_inode_pages(mapping, 0);
|
2009-10-02 18:54:39 +08:00
|
|
|
if (ip) {
|
gfs2: fix GL_SKIP node_scope problems
Before this patch, when a glock was locked, the very first holder on the
queue would unlock the lockref and call the go_instantiate glops function
(if one existed), unless GL_SKIP was specified. When we introduced the new
node-scope concept, we allowed multiple holders to lock glocks in EX mode
and share the lock.
But node-scope introduced a new problem: if the first holder has GL_SKIP
and the next one does NOT, since it is not the first holder on the queue,
the go_instantiate op was not called. Eventually the GL_SKIP holder may
call the instantiate sub-function (e.g. gfs2_rgrp_bh_get) but there was
still a window of time in which another non-GL_SKIP holder assumes the
instantiate function had been called by the first holder. In the case of
rgrp glocks, this led to a NULL pointer dereference on the buffer_heads.
This patch tries to fix the problem by introducing two new glock flags:
GLF_INSTANTIATE_NEEDED, which keeps track of when the instantiate function
needs to be called to "fill in" or "read in" the object before it is
referenced.
GLF_INSTANTIATE_IN_PROG which is used to determine when a process is
in the process of reading in the object. Whenever a function needs to
reference the object, it checks the GLF_INSTANTIATE_NEEDED flag, and if
set, it sets GLF_INSTANTIATE_IN_PROG and calls the glops "go_instantiate"
function.
As before, the gl_lockref spin_lock is unlocked during the IO operation,
which may take a relatively long amount of time to complete. While
unlocked, if another process determines go_instantiate is still needed,
it sees GLF_INSTANTIATE_IN_PROG is set, and waits for the go_instantiate
glop operation to be completed. Once GLF_INSTANTIATE_IN_PROG is cleared,
it needs to check GLF_INSTANTIATE_NEEDED again because the other process's
go_instantiate operation may not have been successful.
Functions that previously called the instantiate sub-functions now call
directly into gfs2_instantiate so the new bits are managed properly.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-06 22:29:18 +08:00
|
|
|
set_bit(GLF_INSTANTIATE_NEEDED, &gl->gl_flags);
|
2009-10-02 18:54:39 +08:00
|
|
|
forget_all_cached_acls(&ip->i_inode);
|
2015-12-25 00:09:40 +08:00
|
|
|
security_inode_invalidate_secctx(&ip->i_inode);
|
2011-06-15 17:29:37 +08:00
|
|
|
gfs2_dir_hash_inval(ip);
|
2009-10-02 18:54:39 +08:00
|
|
|
}
|
2006-11-23 23:51:34 +08:00
|
|
|
}
|
|
|
|
|
2015-03-17 00:52:05 +08:00
|
|
|
if (ip == GFS2_I(gl->gl_name.ln_sbd->sd_rindex)) {
|
2018-01-17 07:01:33 +08:00
|
|
|
gfs2_log_flush(gl->gl_name.ln_sbd, NULL,
|
2018-01-08 23:34:17 +08:00
|
|
|
GFS2_LOG_HEAD_FLUSH_NORMAL |
|
|
|
|
GFS2_LFC_INODE_GO_INVAL);
|
2015-03-17 00:52:05 +08:00
|
|
|
gl->gl_name.ln_sbd->sd_rindex_uptodate = 0;
|
2011-06-14 03:27:40 +08:00
|
|
|
}
|
2007-10-15 22:40:33 +08:00
|
|
|
if (ip && S_ISREG(ip->i_inode.i_mode))
|
2006-11-23 23:51:34 +08:00
|
|
|
truncate_inode_pages(ip->i_inode.i_mapping, 0);
|
2017-06-30 20:47:15 +08:00
|
|
|
|
|
|
|
gfs2_clear_glop_pending(ip);
|
2006-01-17 00:50:04 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* inode_go_demote_ok - Check to see if it's ok to unlock an inode glock
|
|
|
|
* @gl: the glock
|
|
|
|
*
|
|
|
|
* Returns: 1 if it's ok
|
|
|
|
*/
|
|
|
|
|
2008-11-20 21:39:47 +08:00
|
|
|
static int inode_go_demote_ok(const struct gfs2_glock *gl)
|
2006-01-17 00:50:04 +08:00
|
|
|
{
|
2015-03-17 00:52:05 +08:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2011-01-19 17:30:01 +08:00
|
|
|
|
2008-11-20 21:39:47 +08:00
|
|
|
if (sdp->sd_jindex == gl->gl_object || sdp->sd_rindex == gl->gl_object)
|
|
|
|
return 0;
|
2011-01-19 17:30:01 +08:00
|
|
|
|
2008-11-20 21:39:47 +08:00
|
|
|
return 1;
|
2006-01-17 00:50:04 +08:00
|
|
|
}
|
|
|
|
|
2011-05-09 20:49:59 +08:00
|
|
|
static int gfs2_dinode_in(struct gfs2_inode *ip, const void *buf)
|
|
|
|
{
|
2023-03-28 06:43:16 +08:00
|
|
|
struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode);
|
2011-05-09 20:49:59 +08:00
|
|
|
const struct gfs2_dinode *str = buf;
|
2023-10-05 02:52:25 +08:00
|
|
|
struct timespec64 atime, iatime;
|
2011-05-09 20:49:59 +08:00
|
|
|
u16 height, depth;
|
2021-02-13 02:22:38 +08:00
|
|
|
umode_t mode = be32_to_cpu(str->di_mode);
|
2022-12-04 23:50:41 +08:00
|
|
|
struct inode *inode = &ip->i_inode;
|
|
|
|
bool is_new = inode->i_state & I_NEW;
|
2011-05-09 20:49:59 +08:00
|
|
|
|
2024-01-12 01:42:58 +08:00
|
|
|
if (unlikely(ip->i_no_addr != be64_to_cpu(str->di_num.no_addr))) {
|
|
|
|
gfs2_consist_inode(ip);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
if (unlikely(!is_new && inode_wrong_type(inode, mode))) {
|
|
|
|
gfs2_consist_inode(ip);
|
|
|
|
return -EIO;
|
|
|
|
}
|
2011-05-09 20:49:59 +08:00
|
|
|
ip->i_no_formal_ino = be64_to_cpu(str->di_num.no_formal_ino);
|
2022-12-04 23:50:41 +08:00
|
|
|
inode->i_mode = mode;
|
2021-02-13 02:22:38 +08:00
|
|
|
if (is_new) {
|
2022-12-04 23:50:41 +08:00
|
|
|
inode->i_rdev = 0;
|
2021-02-13 02:22:38 +08:00
|
|
|
switch (mode & S_IFMT) {
|
|
|
|
case S_IFBLK:
|
|
|
|
case S_IFCHR:
|
2022-12-04 23:50:41 +08:00
|
|
|
inode->i_rdev = MKDEV(be32_to_cpu(str->di_major),
|
|
|
|
be32_to_cpu(str->di_minor));
|
2021-02-13 02:22:38 +08:00
|
|
|
break;
|
|
|
|
}
|
2019-10-04 23:55:29 +08:00
|
|
|
}
|
2011-05-09 20:49:59 +08:00
|
|
|
|
2022-12-04 23:50:41 +08:00
|
|
|
i_uid_write(inode, be32_to_cpu(str->di_uid));
|
|
|
|
i_gid_write(inode, be32_to_cpu(str->di_gid));
|
|
|
|
set_nlink(inode, be32_to_cpu(str->di_nlink));
|
|
|
|
i_size_write(inode, be64_to_cpu(str->di_size));
|
|
|
|
gfs2_set_inode_blocks(inode, be64_to_cpu(str->di_blocks));
|
2011-05-09 20:49:59 +08:00
|
|
|
atime.tv_sec = be64_to_cpu(str->di_atime);
|
|
|
|
atime.tv_nsec = be32_to_cpu(str->di_atime_nsec);
|
2023-10-05 02:52:25 +08:00
|
|
|
iatime = inode_get_atime(inode);
|
|
|
|
if (timespec64_compare(&iatime, &atime) < 0)
|
|
|
|
inode_set_atime_to_ts(inode, atime);
|
|
|
|
inode_set_mtime(inode, be64_to_cpu(str->di_mtime),
|
|
|
|
be32_to_cpu(str->di_mtime_nsec));
|
2023-07-06 03:01:12 +08:00
|
|
|
inode_set_ctime(inode, be64_to_cpu(str->di_ctime),
|
|
|
|
be32_to_cpu(str->di_ctime_nsec));
|
2011-05-09 20:49:59 +08:00
|
|
|
|
|
|
|
ip->i_goal = be64_to_cpu(str->di_goal_meta);
|
|
|
|
ip->i_generation = be64_to_cpu(str->di_generation);
|
|
|
|
|
|
|
|
ip->i_diskflags = be32_to_cpu(str->di_flags);
|
2011-06-16 21:06:55 +08:00
|
|
|
ip->i_eattr = be64_to_cpu(str->di_eattr);
|
|
|
|
/* i_diskflags and i_eattr must be set before gfs2_set_inode_flags() */
|
2022-12-04 23:50:41 +08:00
|
|
|
gfs2_set_inode_flags(inode);
|
2011-05-09 20:49:59 +08:00
|
|
|
height = be16_to_cpu(str->di_height);
|
2024-01-12 01:42:58 +08:00
|
|
|
if (unlikely(height > sdp->sd_max_height)) {
|
|
|
|
gfs2_consist_inode(ip);
|
|
|
|
return -EIO;
|
|
|
|
}
|
2011-05-09 20:49:59 +08:00
|
|
|
ip->i_height = (u8)height;
|
|
|
|
|
|
|
|
depth = be16_to_cpu(str->di_depth);
|
2024-01-12 01:42:58 +08:00
|
|
|
if (unlikely(depth > GFS2_DIR_MAX_DEPTH)) {
|
|
|
|
gfs2_consist_inode(ip);
|
|
|
|
return -EIO;
|
|
|
|
}
|
2011-05-09 20:49:59 +08:00
|
|
|
ip->i_depth = (u8)depth;
|
|
|
|
ip->i_entries = be32_to_cpu(str->di_entries);
|
|
|
|
|
2024-01-12 01:42:58 +08:00
|
|
|
if (gfs2_is_stuffed(ip) && inode->i_size > gfs2_max_stuffed_size(ip)) {
|
|
|
|
gfs2_consist_inode(ip);
|
|
|
|
return -EIO;
|
|
|
|
}
|
2022-12-04 23:50:41 +08:00
|
|
|
if (S_ISREG(inode->i_mode))
|
|
|
|
gfs2_set_aops(inode);
|
2011-05-09 20:49:59 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* gfs2_inode_refresh - Refresh the incore copy of the dinode
|
|
|
|
* @ip: The GFS2 inode
|
|
|
|
*
|
|
|
|
* Returns: errno
|
|
|
|
*/
|
|
|
|
|
|
|
|
int gfs2_inode_refresh(struct gfs2_inode *ip)
|
|
|
|
{
|
|
|
|
struct buffer_head *dibh;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
error = gfs2_meta_inode_buffer(ip, &dibh);
|
|
|
|
if (error)
|
|
|
|
return error;
|
|
|
|
|
|
|
|
error = gfs2_dinode_in(ip, dibh->b_data);
|
|
|
|
brelse(dibh);
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2006-01-17 00:50:04 +08:00
|
|
|
/**
|
2021-09-30 04:06:21 +08:00
|
|
|
* inode_go_instantiate - read in an inode if necessary
|
2023-11-13 23:54:59 +08:00
|
|
|
* @gl: The glock
|
2006-01-17 00:50:04 +08:00
|
|
|
*
|
|
|
|
* Returns: errno
|
|
|
|
*/
|
|
|
|
|
2022-06-10 18:06:06 +08:00
|
|
|
static int inode_go_instantiate(struct gfs2_glock *gl)
|
2006-01-17 00:50:04 +08:00
|
|
|
{
|
2006-02-28 06:23:27 +08:00
|
|
|
struct gfs2_inode *ip = gl->gl_object;
|
2006-01-17 00:50:04 +08:00
|
|
|
|
gfs2: fix GL_SKIP node_scope problems
Before this patch, when a glock was locked, the very first holder on the
queue would unlock the lockref and call the go_instantiate glops function
(if one existed), unless GL_SKIP was specified. When we introduced the new
node-scope concept, we allowed multiple holders to lock glocks in EX mode
and share the lock.
But node-scope introduced a new problem: if the first holder has GL_SKIP
and the next one does NOT, since it is not the first holder on the queue,
the go_instantiate op was not called. Eventually the GL_SKIP holder may
call the instantiate sub-function (e.g. gfs2_rgrp_bh_get) but there was
still a window of time in which another non-GL_SKIP holder assumes the
instantiate function had been called by the first holder. In the case of
rgrp glocks, this led to a NULL pointer dereference on the buffer_heads.
This patch tries to fix the problem by introducing two new glock flags:
GLF_INSTANTIATE_NEEDED, which keeps track of when the instantiate function
needs to be called to "fill in" or "read in" the object before it is
referenced.
GLF_INSTANTIATE_IN_PROG which is used to determine when a process is
in the process of reading in the object. Whenever a function needs to
reference the object, it checks the GLF_INSTANTIATE_NEEDED flag, and if
set, it sets GLF_INSTANTIATE_IN_PROG and calls the glops "go_instantiate"
function.
As before, the gl_lockref spin_lock is unlocked during the IO operation,
which may take a relatively long amount of time to complete. While
unlocked, if another process determines go_instantiate is still needed,
it sees GLF_INSTANTIATE_IN_PROG is set, and waits for the go_instantiate
glop operation to be completed. Once GLF_INSTANTIATE_IN_PROG is cleared,
it needs to check GLF_INSTANTIATE_NEEDED again because the other process's
go_instantiate operation may not have been successful.
Functions that previously called the instantiate sub-functions now call
directly into gfs2_instantiate so the new bits are managed properly.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-06 22:29:18 +08:00
|
|
|
if (!ip) /* no inode to populate - read it in later */
|
2022-06-10 17:42:33 +08:00
|
|
|
return 0;
|
2006-01-17 00:50:04 +08:00
|
|
|
|
2022-06-10 17:42:33 +08:00
|
|
|
return gfs2_inode_refresh(ip);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int inode_go_held(struct gfs2_holder *gh)
|
|
|
|
{
|
|
|
|
struct gfs2_glock *gl = gh->gh_gl;
|
|
|
|
struct gfs2_inode *ip = gl->gl_object;
|
|
|
|
int error = 0;
|
|
|
|
|
|
|
|
if (!ip) /* no inode to populate - read it in later */
|
|
|
|
return 0;
|
2006-01-17 00:50:04 +08:00
|
|
|
|
2013-12-19 19:04:14 +08:00
|
|
|
if (gh->gh_state != LM_ST_DEFERRED)
|
|
|
|
inode_dio_wait(&ip->i_inode);
|
|
|
|
|
2008-11-04 18:05:22 +08:00
|
|
|
if ((ip->i_diskflags & GFS2_DIF_TRUNC_IN_PROG) &&
|
2006-01-17 00:50:04 +08:00
|
|
|
(gl->gl_state == LM_ST_EXCLUSIVE) &&
|
2022-06-03 04:15:02 +08:00
|
|
|
(gh->gh_state == LM_ST_EXCLUSIVE))
|
|
|
|
error = gfs2_truncatei_resume(ip);
|
2006-01-17 00:50:04 +08:00
|
|
|
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2008-05-22 00:03:22 +08:00
|
|
|
/**
|
|
|
|
* inode_go_dump - print information about an inode
|
|
|
|
* @seq: The iterator
|
2021-03-31 00:44:29 +08:00
|
|
|
* @gl: The glock
|
2019-05-09 22:21:48 +08:00
|
|
|
* @fs_id_buf: file system id (may be empty)
|
2008-05-22 00:03:22 +08:00
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
2023-06-22 04:32:06 +08:00
|
|
|
static void inode_go_dump(struct seq_file *seq, const struct gfs2_glock *gl,
|
2019-05-09 22:21:48 +08:00
|
|
|
const char *fs_id_buf)
|
2008-05-22 00:03:22 +08:00
|
|
|
{
|
2018-04-19 03:05:01 +08:00
|
|
|
struct gfs2_inode *ip = gl->gl_object;
|
2023-06-22 04:32:06 +08:00
|
|
|
const struct inode *inode = &ip->i_inode;
|
2018-04-19 03:05:01 +08:00
|
|
|
|
2008-05-22 00:03:22 +08:00
|
|
|
if (ip == NULL)
|
2014-01-16 18:31:13 +08:00
|
|
|
return;
|
2018-04-19 03:05:01 +08:00
|
|
|
|
2019-05-09 22:21:48 +08:00
|
|
|
gfs2_print_dbg(seq, "%s I: n:%llu/%llu t:%u f:0x%02lx d:0x%08x s:%llu "
|
|
|
|
"p:%lu\n", fs_id_buf,
|
2008-05-22 00:03:22 +08:00
|
|
|
(unsigned long long)ip->i_no_formal_ino,
|
|
|
|
(unsigned long long)ip->i_no_addr,
|
2023-06-22 04:32:06 +08:00
|
|
|
IF2DT(inode->i_mode), ip->i_flags,
|
2008-11-10 18:10:12 +08:00
|
|
|
(unsigned int)ip->i_diskflags,
|
2023-06-22 04:32:06 +08:00
|
|
|
(unsigned long long)i_size_read(inode),
|
|
|
|
inode->i_data.nrpages);
|
2008-05-22 00:03:22 +08:00
|
|
|
}
|
|
|
|
|
2006-01-17 00:50:04 +08:00
|
|
|
/**
|
gfs2: Rework freeze / thaw logic
So far, at mount time, gfs2 would take the freeze glock in shared mode
and then immediately drop it again, turning it into a cached glock that
can be reclaimed at any time. To freeze the filesystem cluster-wide,
the node initiating the freeze would take the freeze glock in exclusive
mode, which would cause the freeze glock's freeze_go_sync() callback to
run on each node. There, gfs2 would freeze the filesystem and schedule
gfs2_freeze_func() to run. gfs2_freeze_func() would re-acquire the
freeze glock in shared mode, thaw the filesystem, and drop the freeze
glock again. The initiating node would keep the freeze glock held in
exclusive mode. To thaw the filesystem, the initiating node would drop
the freeze glock again, which would allow gfs2_freeze_func() to resume
on all nodes, leaving the filesystem in the thawed state.
It turns out that in freeze_go_sync(), we cannot reliably and safely
freeze the filesystem. This is primarily because the final unmount of a
filesystem takes a write lock on the s_umount rw semaphore before
calling into gfs2_put_super(), and freeze_go_sync() needs to call
freeze_super() which also takes a write lock on the same semaphore,
causing a deadlock. We could work around this by trying to take an
active reference on the super block first, which would prevent unmount
from running at the same time. But that can fail, and freeze_go_sync()
isn't actually allowed to fail.
To get around this, this patch changes the freeze glock locking scheme
as follows:
At mount time, each node takes the freeze glock in shared mode. To
freeze a filesystem, the initiating node first freezes the filesystem
locally and then drops and re-acquires the freeze glock in exclusive
mode. All other nodes notice that there is contention on the freeze
glock in their go_callback callbacks, and they schedule
gfs2_freeze_func() to run. There, they freeze the filesystem locally
and drop and re-acquire the freeze glock before re-thawing the
filesystem. This is happening outside of the glock state engine, so
there, we are allowed to fail.
From a cluster point of view, taking and immediately dropping a glock is
indistinguishable from taking the glock and only dropping it upon
contention, so this new scheme is compatible with the old one.
Thanks to Li Dong <lidong@vivo.com> for reporting a locking bug in
gfs2_freeze_func() in a previous version of this commit.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-11-15 06:34:50 +08:00
|
|
|
* freeze_go_callback - A cluster node is requesting a freeze
|
2006-01-17 00:50:04 +08:00
|
|
|
* @gl: the glock
|
gfs2: Rework freeze / thaw logic
So far, at mount time, gfs2 would take the freeze glock in shared mode
and then immediately drop it again, turning it into a cached glock that
can be reclaimed at any time. To freeze the filesystem cluster-wide,
the node initiating the freeze would take the freeze glock in exclusive
mode, which would cause the freeze glock's freeze_go_sync() callback to
run on each node. There, gfs2 would freeze the filesystem and schedule
gfs2_freeze_func() to run. gfs2_freeze_func() would re-acquire the
freeze glock in shared mode, thaw the filesystem, and drop the freeze
glock again. The initiating node would keep the freeze glock held in
exclusive mode. To thaw the filesystem, the initiating node would drop
the freeze glock again, which would allow gfs2_freeze_func() to resume
on all nodes, leaving the filesystem in the thawed state.
It turns out that in freeze_go_sync(), we cannot reliably and safely
freeze the filesystem. This is primarily because the final unmount of a
filesystem takes a write lock on the s_umount rw semaphore before
calling into gfs2_put_super(), and freeze_go_sync() needs to call
freeze_super() which also takes a write lock on the same semaphore,
causing a deadlock. We could work around this by trying to take an
active reference on the super block first, which would prevent unmount
from running at the same time. But that can fail, and freeze_go_sync()
isn't actually allowed to fail.
To get around this, this patch changes the freeze glock locking scheme
as follows:
At mount time, each node takes the freeze glock in shared mode. To
freeze a filesystem, the initiating node first freezes the filesystem
locally and then drops and re-acquires the freeze glock in exclusive
mode. All other nodes notice that there is contention on the freeze
glock in their go_callback callbacks, and they schedule
gfs2_freeze_func() to run. There, they freeze the filesystem locally
and drop and re-acquire the freeze glock before re-thawing the
filesystem. This is happening outside of the glock state engine, so
there, we are allowed to fail.
From a cluster point of view, taking and immediately dropping a glock is
indistinguishable from taking the glock and only dropping it upon
contention, so this new scheme is compatible with the old one.
Thanks to Li Dong <lidong@vivo.com> for reporting a locking bug in
gfs2_freeze_func() in a previous version of this commit.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-11-15 06:34:50 +08:00
|
|
|
* @remote: true if this came from a different cluster node
|
2006-01-17 00:50:04 +08:00
|
|
|
*/
|
|
|
|
|
gfs2: Rework freeze / thaw logic
So far, at mount time, gfs2 would take the freeze glock in shared mode
and then immediately drop it again, turning it into a cached glock that
can be reclaimed at any time. To freeze the filesystem cluster-wide,
the node initiating the freeze would take the freeze glock in exclusive
mode, which would cause the freeze glock's freeze_go_sync() callback to
run on each node. There, gfs2 would freeze the filesystem and schedule
gfs2_freeze_func() to run. gfs2_freeze_func() would re-acquire the
freeze glock in shared mode, thaw the filesystem, and drop the freeze
glock again. The initiating node would keep the freeze glock held in
exclusive mode. To thaw the filesystem, the initiating node would drop
the freeze glock again, which would allow gfs2_freeze_func() to resume
on all nodes, leaving the filesystem in the thawed state.
It turns out that in freeze_go_sync(), we cannot reliably and safely
freeze the filesystem. This is primarily because the final unmount of a
filesystem takes a write lock on the s_umount rw semaphore before
calling into gfs2_put_super(), and freeze_go_sync() needs to call
freeze_super() which also takes a write lock on the same semaphore,
causing a deadlock. We could work around this by trying to take an
active reference on the super block first, which would prevent unmount
from running at the same time. But that can fail, and freeze_go_sync()
isn't actually allowed to fail.
To get around this, this patch changes the freeze glock locking scheme
as follows:
At mount time, each node takes the freeze glock in shared mode. To
freeze a filesystem, the initiating node first freezes the filesystem
locally and then drops and re-acquires the freeze glock in exclusive
mode. All other nodes notice that there is contention on the freeze
glock in their go_callback callbacks, and they schedule
gfs2_freeze_func() to run. There, they freeze the filesystem locally
and drop and re-acquire the freeze glock before re-thawing the
filesystem. This is happening outside of the glock state engine, so
there, we are allowed to fail.
From a cluster point of view, taking and immediately dropping a glock is
indistinguishable from taking the glock and only dropping it upon
contention, so this new scheme is compatible with the old one.
Thanks to Li Dong <lidong@vivo.com> for reporting a locking bug in
gfs2_freeze_func() in a previous version of this commit.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-11-15 06:34:50 +08:00
|
|
|
static void freeze_go_callback(struct gfs2_glock *gl, bool remote)
|
2006-01-17 00:50:04 +08:00
|
|
|
{
|
2015-03-17 00:52:05 +08:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
gfs2: Rework freeze / thaw logic
So far, at mount time, gfs2 would take the freeze glock in shared mode
and then immediately drop it again, turning it into a cached glock that
can be reclaimed at any time. To freeze the filesystem cluster-wide,
the node initiating the freeze would take the freeze glock in exclusive
mode, which would cause the freeze glock's freeze_go_sync() callback to
run on each node. There, gfs2 would freeze the filesystem and schedule
gfs2_freeze_func() to run. gfs2_freeze_func() would re-acquire the
freeze glock in shared mode, thaw the filesystem, and drop the freeze
glock again. The initiating node would keep the freeze glock held in
exclusive mode. To thaw the filesystem, the initiating node would drop
the freeze glock again, which would allow gfs2_freeze_func() to resume
on all nodes, leaving the filesystem in the thawed state.
It turns out that in freeze_go_sync(), we cannot reliably and safely
freeze the filesystem. This is primarily because the final unmount of a
filesystem takes a write lock on the s_umount rw semaphore before
calling into gfs2_put_super(), and freeze_go_sync() needs to call
freeze_super() which also takes a write lock on the same semaphore,
causing a deadlock. We could work around this by trying to take an
active reference on the super block first, which would prevent unmount
from running at the same time. But that can fail, and freeze_go_sync()
isn't actually allowed to fail.
To get around this, this patch changes the freeze glock locking scheme
as follows:
At mount time, each node takes the freeze glock in shared mode. To
freeze a filesystem, the initiating node first freezes the filesystem
locally and then drops and re-acquires the freeze glock in exclusive
mode. All other nodes notice that there is contention on the freeze
glock in their go_callback callbacks, and they schedule
gfs2_freeze_func() to run. There, they freeze the filesystem locally
and drop and re-acquire the freeze glock before re-thawing the
filesystem. This is happening outside of the glock state engine, so
there, we are allowed to fail.
From a cluster point of view, taking and immediately dropping a glock is
indistinguishable from taking the glock and only dropping it upon
contention, so this new scheme is compatible with the old one.
Thanks to Li Dong <lidong@vivo.com> for reporting a locking bug in
gfs2_freeze_func() in a previous version of this commit.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-11-15 06:34:50 +08:00
|
|
|
struct super_block *sb = sdp->sd_vfs;
|
|
|
|
|
|
|
|
if (!remote ||
|
2023-09-12 02:00:28 +08:00
|
|
|
(gl->gl_state != LM_ST_SHARED &&
|
|
|
|
gl->gl_state != LM_ST_UNLOCKED) ||
|
gfs2: Rework freeze / thaw logic
So far, at mount time, gfs2 would take the freeze glock in shared mode
and then immediately drop it again, turning it into a cached glock that
can be reclaimed at any time. To freeze the filesystem cluster-wide,
the node initiating the freeze would take the freeze glock in exclusive
mode, which would cause the freeze glock's freeze_go_sync() callback to
run on each node. There, gfs2 would freeze the filesystem and schedule
gfs2_freeze_func() to run. gfs2_freeze_func() would re-acquire the
freeze glock in shared mode, thaw the filesystem, and drop the freeze
glock again. The initiating node would keep the freeze glock held in
exclusive mode. To thaw the filesystem, the initiating node would drop
the freeze glock again, which would allow gfs2_freeze_func() to resume
on all nodes, leaving the filesystem in the thawed state.
It turns out that in freeze_go_sync(), we cannot reliably and safely
freeze the filesystem. This is primarily because the final unmount of a
filesystem takes a write lock on the s_umount rw semaphore before
calling into gfs2_put_super(), and freeze_go_sync() needs to call
freeze_super() which also takes a write lock on the same semaphore,
causing a deadlock. We could work around this by trying to take an
active reference on the super block first, which would prevent unmount
from running at the same time. But that can fail, and freeze_go_sync()
isn't actually allowed to fail.
To get around this, this patch changes the freeze glock locking scheme
as follows:
At mount time, each node takes the freeze glock in shared mode. To
freeze a filesystem, the initiating node first freezes the filesystem
locally and then drops and re-acquires the freeze glock in exclusive
mode. All other nodes notice that there is contention on the freeze
glock in their go_callback callbacks, and they schedule
gfs2_freeze_func() to run. There, they freeze the filesystem locally
and drop and re-acquire the freeze glock before re-thawing the
filesystem. This is happening outside of the glock state engine, so
there, we are allowed to fail.
From a cluster point of view, taking and immediately dropping a glock is
indistinguishable from taking the glock and only dropping it upon
contention, so this new scheme is compatible with the old one.
Thanks to Li Dong <lidong@vivo.com> for reporting a locking bug in
gfs2_freeze_func() in a previous version of this commit.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-11-15 06:34:50 +08:00
|
|
|
gl->gl_demote_state != LM_ST_UNLOCKED)
|
|
|
|
return;
|
2006-01-17 00:50:04 +08:00
|
|
|
|
2020-11-18 21:54:31 +08:00
|
|
|
/*
|
gfs2: Rework freeze / thaw logic
So far, at mount time, gfs2 would take the freeze glock in shared mode
and then immediately drop it again, turning it into a cached glock that
can be reclaimed at any time. To freeze the filesystem cluster-wide,
the node initiating the freeze would take the freeze glock in exclusive
mode, which would cause the freeze glock's freeze_go_sync() callback to
run on each node. There, gfs2 would freeze the filesystem and schedule
gfs2_freeze_func() to run. gfs2_freeze_func() would re-acquire the
freeze glock in shared mode, thaw the filesystem, and drop the freeze
glock again. The initiating node would keep the freeze glock held in
exclusive mode. To thaw the filesystem, the initiating node would drop
the freeze glock again, which would allow gfs2_freeze_func() to resume
on all nodes, leaving the filesystem in the thawed state.
It turns out that in freeze_go_sync(), we cannot reliably and safely
freeze the filesystem. This is primarily because the final unmount of a
filesystem takes a write lock on the s_umount rw semaphore before
calling into gfs2_put_super(), and freeze_go_sync() needs to call
freeze_super() which also takes a write lock on the same semaphore,
causing a deadlock. We could work around this by trying to take an
active reference on the super block first, which would prevent unmount
from running at the same time. But that can fail, and freeze_go_sync()
isn't actually allowed to fail.
To get around this, this patch changes the freeze glock locking scheme
as follows:
At mount time, each node takes the freeze glock in shared mode. To
freeze a filesystem, the initiating node first freezes the filesystem
locally and then drops and re-acquires the freeze glock in exclusive
mode. All other nodes notice that there is contention on the freeze
glock in their go_callback callbacks, and they schedule
gfs2_freeze_func() to run. There, they freeze the filesystem locally
and drop and re-acquire the freeze glock before re-thawing the
filesystem. This is happening outside of the glock state engine, so
there, we are allowed to fail.
From a cluster point of view, taking and immediately dropping a glock is
indistinguishable from taking the glock and only dropping it upon
contention, so this new scheme is compatible with the old one.
Thanks to Li Dong <lidong@vivo.com> for reporting a locking bug in
gfs2_freeze_func() in a previous version of this commit.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-11-15 06:34:50 +08:00
|
|
|
* Try to get an active super block reference to prevent racing with
|
2023-09-12 02:00:28 +08:00
|
|
|
* unmount (see super_trylock_shared()). But note that unmount isn't
|
|
|
|
* the only place where a write lock on s_umount is taken, and we can
|
|
|
|
* fail here because of things like remount as well.
|
2020-11-18 21:54:31 +08:00
|
|
|
*/
|
gfs2: Rework freeze / thaw logic
So far, at mount time, gfs2 would take the freeze glock in shared mode
and then immediately drop it again, turning it into a cached glock that
can be reclaimed at any time. To freeze the filesystem cluster-wide,
the node initiating the freeze would take the freeze glock in exclusive
mode, which would cause the freeze glock's freeze_go_sync() callback to
run on each node. There, gfs2 would freeze the filesystem and schedule
gfs2_freeze_func() to run. gfs2_freeze_func() would re-acquire the
freeze glock in shared mode, thaw the filesystem, and drop the freeze
glock again. The initiating node would keep the freeze glock held in
exclusive mode. To thaw the filesystem, the initiating node would drop
the freeze glock again, which would allow gfs2_freeze_func() to resume
on all nodes, leaving the filesystem in the thawed state.
It turns out that in freeze_go_sync(), we cannot reliably and safely
freeze the filesystem. This is primarily because the final unmount of a
filesystem takes a write lock on the s_umount rw semaphore before
calling into gfs2_put_super(), and freeze_go_sync() needs to call
freeze_super() which also takes a write lock on the same semaphore,
causing a deadlock. We could work around this by trying to take an
active reference on the super block first, which would prevent unmount
from running at the same time. But that can fail, and freeze_go_sync()
isn't actually allowed to fail.
To get around this, this patch changes the freeze glock locking scheme
as follows:
At mount time, each node takes the freeze glock in shared mode. To
freeze a filesystem, the initiating node first freezes the filesystem
locally and then drops and re-acquires the freeze glock in exclusive
mode. All other nodes notice that there is contention on the freeze
glock in their go_callback callbacks, and they schedule
gfs2_freeze_func() to run. There, they freeze the filesystem locally
and drop and re-acquire the freeze glock before re-thawing the
filesystem. This is happening outside of the glock state engine, so
there, we are allowed to fail.
From a cluster point of view, taking and immediately dropping a glock is
indistinguishable from taking the glock and only dropping it upon
contention, so this new scheme is compatible with the old one.
Thanks to Li Dong <lidong@vivo.com> for reporting a locking bug in
gfs2_freeze_func() in a previous version of this commit.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-11-15 06:34:50 +08:00
|
|
|
if (down_read_trylock(&sb->s_umount)) {
|
|
|
|
atomic_inc(&sb->s_active);
|
|
|
|
up_read(&sb->s_umount);
|
|
|
|
if (!queue_work(gfs2_freeze_wq, &sdp->sd_freeze_work))
|
|
|
|
deactivate_super(sb);
|
2006-01-17 00:50:04 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
GFS2: remove transaction glock
GFS2 has a transaction glock, which must be grabbed for every
transaction, whose purpose is to deal with freezing the filesystem.
Aside from this involving a large amount of locking, it is very easy to
make the current fsfreeze code hang on unfreezing.
This patch rewrites how gfs2 handles freezing the filesystem. The
transaction glock is removed. In it's place is a freeze glock, which is
cached (but not held) in a shared state by every node in the cluster
when the filesystem is mounted. This lock only needs to be grabbed on
freezing, and actions which need to be safe from freezing, like
recovery.
When a node wants to freeze the filesystem, it grabs this glock
exclusively. When the freeze glock state changes on the nodes (either
from shared to unlocked, or shared to exclusive), the filesystem does a
special log flush. gfs2_log_flush() does all the work for flushing out
the and shutting down the incore log, and then it tries to grab the
freeze glock in a shared state again. Since the filesystem is stuck in
gfs2_log_flush, no new transaction can start, and nothing can be written
to disk. Unfreezing the filesytem simply involes dropping the freeze
glock, allowing gfs2_log_flush() to grab and then release the shared
lock, so it is cached for next time.
However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
shared lock on the filesystem root directory inode to check permissions.
If that glock has already been grabbed exclusively, fsfreeze will be
unable to get the shared lock and unfreeze the filesystem.
In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
on the filesystem root directory during the freeze, and hold it until it
unfreezes the filesystem. The functions which need to grab a shared
lock in order to allow the unfreeze ioctl to be issued now use the lock
grabbed by the freeze code instead.
The freeze and unfreeze code take care to make sure that this shared
lock will not be dropped while another process is using it.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-05-02 11:26:55 +08:00
|
|
|
* freeze_go_xmote_bh - After promoting/demoting the freeze glock
|
2006-01-17 00:50:04 +08:00
|
|
|
* @gl: the glock
|
|
|
|
*/
|
2021-03-19 19:56:44 +08:00
|
|
|
static int freeze_go_xmote_bh(struct gfs2_glock *gl)
|
2006-01-17 00:50:04 +08:00
|
|
|
{
|
2015-03-17 00:52:05 +08:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2006-06-15 03:32:57 +08:00
|
|
|
struct gfs2_inode *ip = GFS2_I(sdp->sd_jdesc->jd_inode);
|
2006-02-28 06:23:27 +08:00
|
|
|
struct gfs2_glock *j_gl = ip->i_gl;
|
2006-10-14 09:47:13 +08:00
|
|
|
struct gfs2_log_header_host head;
|
2006-01-17 00:50:04 +08:00
|
|
|
int error;
|
|
|
|
|
2008-05-22 00:03:22 +08:00
|
|
|
if (test_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags)) {
|
2006-11-20 23:37:45 +08:00
|
|
|
j_gl->gl_ops->go_inval(j_gl, DIO_METADATA);
|
2006-01-17 00:50:04 +08:00
|
|
|
|
2019-05-03 03:17:40 +08:00
|
|
|
error = gfs2_find_jhead(sdp->sd_jdesc, &head, false);
|
2021-06-01 22:41:40 +08:00
|
|
|
if (gfs2_assert_withdraw_delayed(sdp, !error))
|
|
|
|
return error;
|
|
|
|
if (gfs2_assert_withdraw_delayed(sdp, head.lh_flags &
|
|
|
|
GFS2_LOG_HEAD_UNMOUNT))
|
|
|
|
return -EIO;
|
|
|
|
sdp->sd_log_sequence = head.lh_sequence + 1;
|
|
|
|
gfs2_log_pointers_init(sdp, head.lh_blkno);
|
2006-01-17 00:50:04 +08:00
|
|
|
}
|
2008-05-22 00:03:22 +08:00
|
|
|
return 0;
|
2006-01-17 00:50:04 +08:00
|
|
|
}
|
|
|
|
|
2009-07-24 07:52:34 +08:00
|
|
|
/**
|
|
|
|
* iopen_go_callback - schedule the dcache entry for the inode to be deleted
|
|
|
|
* @gl: the glock
|
2021-03-31 00:44:29 +08:00
|
|
|
* @remote: true if this came from a different cluster node
|
2009-07-24 07:52:34 +08:00
|
|
|
*
|
2015-10-29 23:58:09 +08:00
|
|
|
* gl_lockref.lock lock is held while calling this
|
2009-07-24 07:52:34 +08:00
|
|
|
*/
|
2013-04-10 17:26:55 +08:00
|
|
|
static void iopen_go_callback(struct gfs2_glock *gl, bool remote)
|
2009-07-24 07:52:34 +08:00
|
|
|
{
|
2017-06-30 20:55:08 +08:00
|
|
|
struct gfs2_inode *ip = gl->gl_object;
|
2015-03-17 00:52:05 +08:00
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
2011-03-30 21:17:51 +08:00
|
|
|
|
2022-12-06 07:12:59 +08:00
|
|
|
if (!remote || sb_rdonly(sdp->sd_vfs) ||
|
2023-08-23 21:53:13 +08:00
|
|
|
test_bit(SDF_KILL, &sdp->sd_flags))
|
2011-03-30 21:17:51 +08:00
|
|
|
return;
|
2009-07-24 07:52:34 +08:00
|
|
|
|
|
|
|
if (gl->gl_demote_state == LM_ST_UNLOCKED &&
|
2009-12-08 20:12:13 +08:00
|
|
|
gl->gl_state == LM_ST_SHARED && ip) {
|
2013-10-15 22:18:08 +08:00
|
|
|
gl->gl_lockref.count++;
|
2022-12-21 07:52:51 +08:00
|
|
|
if (!gfs2_queue_try_to_evict(gl))
|
2013-10-15 22:18:08 +08:00
|
|
|
gl->gl_lockref.count--;
|
2009-07-24 07:52:34 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-29 03:23:45 +08:00
|
|
|
/**
|
|
|
|
* inode_go_free - wake up anyone waiting for dlm's unlock ast to free it
|
|
|
|
* @gl: glock being freed
|
|
|
|
*
|
|
|
|
* For now, this is only used for the journal inode glock. In withdraw
|
|
|
|
* situations, we need to wait for the glock to be freed so that we know
|
|
|
|
* other nodes may proceed with recovery / journal replay.
|
|
|
|
*/
|
|
|
|
static void inode_go_free(struct gfs2_glock *gl)
|
|
|
|
{
|
|
|
|
/* Note that we cannot reference gl_object because it's already set
|
|
|
|
* to NULL by this point in its lifecycle. */
|
|
|
|
if (!test_bit(GLF_FREEING, &gl->gl_flags))
|
|
|
|
return;
|
|
|
|
clear_bit_unlock(GLF_FREEING, &gl->gl_flags);
|
|
|
|
wake_up_bit(&gl->gl_flags, GLF_FREEING);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* nondisk_go_callback - used to signal when a node did a withdraw
|
|
|
|
* @gl: the nondisk glock
|
|
|
|
* @remote: true if this came from a different cluster node
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
static void nondisk_go_callback(struct gfs2_glock *gl, bool remote)
|
|
|
|
{
|
|
|
|
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
|
|
|
|
|
|
|
|
/* Ignore the callback unless it's from another node, and it's the
|
|
|
|
live lock. */
|
|
|
|
if (!remote || gl->gl_name.ln_number != GFS2_LIVE_LOCK)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* First order of business is to cancel the demote request. We don't
|
|
|
|
* really want to demote a nondisk glock. At best it's just to inform
|
|
|
|
* us of another node's withdraw. We'll keep it in SH mode. */
|
|
|
|
clear_bit(GLF_DEMOTE, &gl->gl_flags);
|
|
|
|
clear_bit(GLF_PENDING_DEMOTE, &gl->gl_flags);
|
|
|
|
|
|
|
|
/* Ignore the unlock if we're withdrawn, unmounting, or in recovery. */
|
|
|
|
if (test_bit(SDF_NORECOVERY, &sdp->sd_flags) ||
|
|
|
|
test_bit(SDF_WITHDRAWN, &sdp->sd_flags) ||
|
|
|
|
test_bit(SDF_REMOTE_WITHDRAW, &sdp->sd_flags))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* We only care when a node wants us to unlock, because that means
|
|
|
|
* they want a journal recovered. */
|
|
|
|
if (gl->gl_demote_state != LM_ST_UNLOCKED)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (sdp->sd_args.ar_spectator) {
|
|
|
|
fs_warn(sdp, "Spectator node cannot recover journals.\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
fs_warn(sdp, "Some node has withdrawn; checking for recovery.\n");
|
|
|
|
set_bit(SDF_REMOTE_WITHDRAW, &sdp->sd_flags);
|
|
|
|
/*
|
|
|
|
* We can't call remote_withdraw directly here or gfs2_recover_journal
|
|
|
|
* because this is called from the glock unlock function and the
|
|
|
|
* remote_withdraw needs to enqueue and dequeue the same "live" glock
|
|
|
|
* we were called from. So we queue it to the control work queue in
|
|
|
|
* lock_dlm.
|
|
|
|
*/
|
|
|
|
queue_delayed_work(gfs2_control_wq, &sdp->sd_control_work, 0);
|
|
|
|
}
|
|
|
|
|
2006-08-30 21:30:00 +08:00
|
|
|
const struct gfs2_glock_operations gfs2_meta_glops = {
|
2006-09-05 22:53:09 +08:00
|
|
|
.go_type = LM_TYPE_META,
|
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-14 02:28:45 +08:00
|
|
|
.go_flags = GLOF_NONDISK,
|
2006-01-17 00:50:04 +08:00
|
|
|
};
|
|
|
|
|
2006-08-30 21:30:00 +08:00
|
|
|
const struct gfs2_glock_operations gfs2_inode_glops = {
|
2012-10-25 02:41:05 +08:00
|
|
|
.go_sync = inode_go_sync,
|
2006-01-17 00:50:04 +08:00
|
|
|
.go_inval = inode_go_inval,
|
|
|
|
.go_demote_ok = inode_go_demote_ok,
|
2021-09-30 04:06:21 +08:00
|
|
|
.go_instantiate = inode_go_instantiate,
|
2022-06-10 17:42:33 +08:00
|
|
|
.go_held = inode_go_held,
|
2008-05-22 00:03:22 +08:00
|
|
|
.go_dump = inode_go_dump,
|
2006-09-05 22:53:09 +08:00
|
|
|
.go_type = LM_TYPE_INODE,
|
2020-01-14 04:21:49 +08:00
|
|
|
.go_flags = GLOF_ASPACE | GLOF_LRU | GLOF_LVB,
|
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-29 03:23:45 +08:00
|
|
|
.go_free = inode_go_free,
|
2006-01-17 00:50:04 +08:00
|
|
|
};
|
|
|
|
|
2006-08-30 21:30:00 +08:00
|
|
|
const struct gfs2_glock_operations gfs2_rgrp_glops = {
|
2012-10-25 02:41:05 +08:00
|
|
|
.go_sync = rgrp_go_sync,
|
2009-03-09 17:03:51 +08:00
|
|
|
.go_inval = rgrp_go_inval,
|
2021-09-30 04:06:21 +08:00
|
|
|
.go_instantiate = gfs2_rgrp_go_instantiate,
|
2020-10-07 19:30:58 +08:00
|
|
|
.go_dump = gfs2_rgrp_go_dump,
|
2006-09-05 22:53:09 +08:00
|
|
|
.go_type = LM_TYPE_RGRP,
|
2013-12-07 00:19:54 +08:00
|
|
|
.go_flags = GLOF_LVB,
|
2006-01-17 00:50:04 +08:00
|
|
|
};
|
|
|
|
|
GFS2: remove transaction glock
GFS2 has a transaction glock, which must be grabbed for every
transaction, whose purpose is to deal with freezing the filesystem.
Aside from this involving a large amount of locking, it is very easy to
make the current fsfreeze code hang on unfreezing.
This patch rewrites how gfs2 handles freezing the filesystem. The
transaction glock is removed. In it's place is a freeze glock, which is
cached (but not held) in a shared state by every node in the cluster
when the filesystem is mounted. This lock only needs to be grabbed on
freezing, and actions which need to be safe from freezing, like
recovery.
When a node wants to freeze the filesystem, it grabs this glock
exclusively. When the freeze glock state changes on the nodes (either
from shared to unlocked, or shared to exclusive), the filesystem does a
special log flush. gfs2_log_flush() does all the work for flushing out
the and shutting down the incore log, and then it tries to grab the
freeze glock in a shared state again. Since the filesystem is stuck in
gfs2_log_flush, no new transaction can start, and nothing can be written
to disk. Unfreezing the filesytem simply involes dropping the freeze
glock, allowing gfs2_log_flush() to grab and then release the shared
lock, so it is cached for next time.
However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
shared lock on the filesystem root directory inode to check permissions.
If that glock has already been grabbed exclusively, fsfreeze will be
unable to get the shared lock and unfreeze the filesystem.
In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
on the filesystem root directory during the freeze, and hold it until it
unfreezes the filesystem. The functions which need to grab a shared
lock in order to allow the unfreeze ioctl to be issued now use the lock
grabbed by the freeze code instead.
The freeze and unfreeze code take care to make sure that this shared
lock will not be dropped while another process is using it.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-05-02 11:26:55 +08:00
|
|
|
const struct gfs2_glock_operations gfs2_freeze_glops = {
|
|
|
|
.go_xmote_bh = freeze_go_xmote_bh,
|
gfs2: Rework freeze / thaw logic
So far, at mount time, gfs2 would take the freeze glock in shared mode
and then immediately drop it again, turning it into a cached glock that
can be reclaimed at any time. To freeze the filesystem cluster-wide,
the node initiating the freeze would take the freeze glock in exclusive
mode, which would cause the freeze glock's freeze_go_sync() callback to
run on each node. There, gfs2 would freeze the filesystem and schedule
gfs2_freeze_func() to run. gfs2_freeze_func() would re-acquire the
freeze glock in shared mode, thaw the filesystem, and drop the freeze
glock again. The initiating node would keep the freeze glock held in
exclusive mode. To thaw the filesystem, the initiating node would drop
the freeze glock again, which would allow gfs2_freeze_func() to resume
on all nodes, leaving the filesystem in the thawed state.
It turns out that in freeze_go_sync(), we cannot reliably and safely
freeze the filesystem. This is primarily because the final unmount of a
filesystem takes a write lock on the s_umount rw semaphore before
calling into gfs2_put_super(), and freeze_go_sync() needs to call
freeze_super() which also takes a write lock on the same semaphore,
causing a deadlock. We could work around this by trying to take an
active reference on the super block first, which would prevent unmount
from running at the same time. But that can fail, and freeze_go_sync()
isn't actually allowed to fail.
To get around this, this patch changes the freeze glock locking scheme
as follows:
At mount time, each node takes the freeze glock in shared mode. To
freeze a filesystem, the initiating node first freezes the filesystem
locally and then drops and re-acquires the freeze glock in exclusive
mode. All other nodes notice that there is contention on the freeze
glock in their go_callback callbacks, and they schedule
gfs2_freeze_func() to run. There, they freeze the filesystem locally
and drop and re-acquire the freeze glock before re-thawing the
filesystem. This is happening outside of the glock state engine, so
there, we are allowed to fail.
From a cluster point of view, taking and immediately dropping a glock is
indistinguishable from taking the glock and only dropping it upon
contention, so this new scheme is compatible with the old one.
Thanks to Li Dong <lidong@vivo.com> for reporting a locking bug in
gfs2_freeze_func() in a previous version of this commit.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-11-15 06:34:50 +08:00
|
|
|
.go_callback = freeze_go_callback,
|
2006-09-05 22:53:09 +08:00
|
|
|
.go_type = LM_TYPE_NONDISK,
|
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-14 02:28:45 +08:00
|
|
|
.go_flags = GLOF_NONDISK,
|
2006-01-17 00:50:04 +08:00
|
|
|
};
|
|
|
|
|
2006-08-30 21:30:00 +08:00
|
|
|
const struct gfs2_glock_operations gfs2_iopen_glops = {
|
2006-09-05 22:53:09 +08:00
|
|
|
.go_type = LM_TYPE_IOPEN,
|
2009-07-24 07:52:34 +08:00
|
|
|
.go_callback = iopen_go_callback,
|
2021-12-14 23:40:12 +08:00
|
|
|
.go_dump = inode_go_dump,
|
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-14 02:28:45 +08:00
|
|
|
.go_flags = GLOF_LRU | GLOF_NONDISK,
|
2020-11-23 23:53:35 +08:00
|
|
|
.go_subclass = 1,
|
2006-01-17 00:50:04 +08:00
|
|
|
};
|
|
|
|
|
2006-08-30 21:30:00 +08:00
|
|
|
const struct gfs2_glock_operations gfs2_flock_glops = {
|
2006-09-05 22:53:09 +08:00
|
|
|
.go_type = LM_TYPE_FLOCK,
|
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-14 02:28:45 +08:00
|
|
|
.go_flags = GLOF_LRU | GLOF_NONDISK,
|
2006-01-17 00:50:04 +08:00
|
|
|
};
|
|
|
|
|
2006-08-30 21:30:00 +08:00
|
|
|
const struct gfs2_glock_operations gfs2_nondisk_glops = {
|
2006-09-05 22:53:09 +08:00
|
|
|
.go_type = LM_TYPE_NONDISK,
|
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-14 02:28:45 +08:00
|
|
|
.go_flags = GLOF_NONDISK,
|
gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.
This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.
The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.
Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held.
The "live" glock which is now used to signal when a withdraw
occurs. When a withdraw occurs, the node signals its withdraw by
dequeueing the "live" glock and trying to enqueue it in EX mode,
thus forcing the other nodes to all see a demote request, by way
of a "1CB" (one callback) try lock. The "live" glock is not
granted in EX; the callback is only just used to indicate a
withdraw has occurred.
Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.
Also note that the withdraw function may be called from a wide
variety of situations, and therefore, we need to take extra
precautions to make sure pointers are valid before using them in
many circumstances.
We also need to take care when glocks decide to withdraw, since
the withdraw code now uses glocks.
Also, before this patch, if a process encountered an error and
decided to withdraw, if another process was already withdrawing,
the second withdraw would be silently ignored, which set it free
to unlock its glocks. That's correct behavior if the original
withdrawer encounters further errors down the road. But if
secondary waiters don't wait for the journal replay, unlocking
glocks will allow other nodes to use them, despite the fact that
the journal containing those blocks is being replayed. The
replay needs to finish before our glocks are released to other
nodes. IOW, secondary withdraws need to wait for the first
withdraw to finish.
For example, if an rgrp glock is unlocked by a process that didn't
wait for the first withdraw, a journal replay could introduce file
system corruption by replaying a rgrp block that has already been
granted to a different cluster node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-01-29 03:23:45 +08:00
|
|
|
.go_callback = nondisk_go_callback,
|
2006-01-17 00:50:04 +08:00
|
|
|
};
|
|
|
|
|
2006-08-30 21:30:00 +08:00
|
|
|
const struct gfs2_glock_operations gfs2_quota_glops = {
|
2006-09-05 22:53:09 +08:00
|
|
|
.go_type = LM_TYPE_QUOTA,
|
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-14 02:28:45 +08:00
|
|
|
.go_flags = GLOF_LVB | GLOF_LRU | GLOF_NONDISK,
|
2006-01-17 00:50:04 +08:00
|
|
|
};
|
|
|
|
|
2006-08-30 21:30:00 +08:00
|
|
|
const struct gfs2_glock_operations gfs2_journal_glops = {
|
2006-09-05 22:53:09 +08:00
|
|
|
.go_type = LM_TYPE_JOURNAL,
|
gfs2: Allow some glocks to be used during withdraw
We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
when we're withdrawn. For example, to maintain metadata integrity, we should
disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
iopen or the transaction glocks may be safely used because none of their
metadata goes through the journal. So in general, we should disallow all
glocks with an address space, and allow all the others. One exception is:
we need to allow our active journal to be demoted so others may recover it.
Allowing glocks after withdraw gives us the ability to take appropriate
action (in a following patch) to have our journal properly replayed by
another node rather than just abandoning the current transactions and
pretending nothing bad happened, leaving the other nodes free to modify
the blocks we had in our journal, which may result in file system
corruption.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2019-06-14 02:28:45 +08:00
|
|
|
.go_flags = GLOF_NONDISK,
|
2006-01-17 00:50:04 +08:00
|
|
|
};
|
|
|
|
|
GFS2: Add a "demote a glock" interface to sysfs
This adds a sysfs file called demote_rq to GFS2's
per filesystem directory. Its possible to use this
file to demote arbitrary glocks in exactly the same
way as if a request had come in from a remote node.
This is intended for testing issues relating to caching
of data under glocks. Despite that, the interface is
generic enough to send requests to any type of glock,
but be careful as its not always safe to send an
arbitrary message to an arbitrary glock. For that reason
and to prevent DoS, this interface is restricted to root
only.
The messages look like this:
<type>:<glocknumber> <mode>
Example:
echo -n "2:13324 EX" >/sys/fs/gfs2/unity:myfs/demote_rq
Which means "please demote inode glock (type 2) number 13324 so that
I can get an EX (exclusive) lock". The lock modes are those which
would normally be sent by a remote node in its callback so if you
want to unlock a glock, you use EX, to demote to shared, use SH or PR
(depending on whether you like GFS2 or DLM lock modes better!).
If the glock doesn't exist, you'll get -ENOENT returned. If the
arguments don't make sense, you'll get -EINVAL returned.
The plan is that this interface will be used in combination with
the blktrace patch which I recently posted for comments although
it is, of course, still useful in its own right.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-02-12 21:31:58 +08:00
|
|
|
const struct gfs2_glock_operations *gfs2_glops_list[] = {
|
|
|
|
[LM_TYPE_META] = &gfs2_meta_glops,
|
|
|
|
[LM_TYPE_INODE] = &gfs2_inode_glops,
|
|
|
|
[LM_TYPE_RGRP] = &gfs2_rgrp_glops,
|
|
|
|
[LM_TYPE_IOPEN] = &gfs2_iopen_glops,
|
|
|
|
[LM_TYPE_FLOCK] = &gfs2_flock_glops,
|
|
|
|
[LM_TYPE_NONDISK] = &gfs2_nondisk_glops,
|
|
|
|
[LM_TYPE_QUOTA] = &gfs2_quota_glops,
|
|
|
|
[LM_TYPE_JOURNAL] = &gfs2_journal_glops,
|
|
|
|
};
|
|
|
|
|