当前位置：首页 > article >正文

Linux 内核自旋锁spinlock（一）

article 2025/1/18 6:11:00

文章目录

前言
一、自旋锁
- 1.1 简介
- 1.2 API
- - 1.2.1 spin_lock/spin_unlock
  - 1.2.2 spin_lock_irq/spin_unlock_irq
  - 1.2.3 spin_lock_irqsave/spin_unlock_irqstore
  - 1.2.4 spin_lock_bh/spin_unlock_bh
  - 1.2.5 补充
二、自选锁原理
三、自旋锁在内核的使用
- 3.1 struct file
- 3.2 struct dentry
- 3.3 struct inode
- 3.4 struct super_block
参考资料

前言

在软件工程中，自旋锁是一种用于多线程应用程序中的同步原语，用于保护共享资源。当一个线程尝试获取自旋锁但发现它已被另一个线程持有时，该线程不会立即进入睡眠状态（如传统互斥锁的情况），而是会在一个循环中持续“自旋”，不断检查锁是否可用。这种行为被称为忙等待。

自旋锁能够避免操作系统进程重新调度或上下文切换的开销，因此在线程可能仅会短暂阻塞的情况下，自旋锁非常高效。出于这个原因，操作系统内核经常使用自旋锁。

对于操作系统内核：
在SMP系统上，各个CPU可能同时处于核心态，在理论上可以操作所有现存的数据结构。为阻止CPU彼此干扰，需要通过锁保护内核的某些范围。锁可以确保每次只能有一个CPU访问被保护的范围。

内核可以不受限制地访问整个地址空间。在多处理器系统上（或类似地，在启用了内核抢占的单处理器系统上），这会引起一些问题。如果几个处理器同时处于核心态，则理论上它们可以同时访问同一个数据结构。

一、自旋锁

1.1 简介

内核当发生访问资源冲突的时候，可以有两种锁的解决方案选择：
（1）一个是原地等待
（2）一个是挂起当前进程，调度其他进程执行（睡眠）
自旋锁就是原地等待，直到锁释放。

中断上下文要用锁，首选 spinlock。

自旋锁：这些是最常用的锁选项。它们用于短期保护某段代码，以防止其他处理器的访问。在内核等待自旋锁释放时，会重复检查是否能获取锁，而不会进入睡眠状态（忙等待）。当然，如果等待时间较长，则效率显然不高。

自旋锁用于保护短的代码段，其中只包含少量C语句，因此会很快执行完毕。大多数内核数据结构都有自身的自旋锁，在处理结构中的关键成员时，必须获得相应的自旋锁。自旋锁在内核源代码中普遍存在。如下图所示：
在这里插入图片描述
当运行任务 B 的 CPU(B) 想要通过自旋锁的锁定函数获取锁，而这个自旋锁已经被另一个 CPU（例如运行任务 A 的 CPU(A)）持有时，CPU(B) 将简单地在一个循环中旋转，从而阻塞任务 B，直到另一个 CPU 释放该锁（任务 A 调用自旋锁的释放函数）。这种自旋只会发生在多核机器上，这就解释了之前描述的使用情况，因为涉及多个 CPU，所以在单核机器上不会发生：任务要么持有自旋锁并继续执行，要么直到锁被释放才能运行。自旋锁是由 CPU 持有的锁，这与互斥锁相反，互斥锁是由任务持有的锁。自旋锁通过禁用本地 CPU 上的调度程序运行（即运行调用自旋锁的锁定 API 的任务的 CPU）。这也意味着当前在该 CPU 上运行的任务不能被另一个任务抢占，除非 IRQs 被禁用（稍后会详细讨论）。换句话说，自旋锁保护只能由一个 CPU 一次获取/访问的资源。这使得自旋锁适用于 SMP 安全性和执行原子任务。

自旋锁并不是唯一利用硬件原子功能的实现方式。例如，在 Linux 内核中，抢占状态取决于每个 CPU 变量，如果等于 0，则表示抢占已启用。然而，如果大于 0，则表示抢占已禁用（schedule() 变为无效）。因此，禁用抢占（preempt_disable()）包括向当前每个 CPU 变量（实际为 preempt_count）添加 1，而 preempt_enable() 则从变量中减去 1，检查新值是否为 0，并调用 schedule()。这些加法/减法操作应该是原子的，因此依赖于 CPU 能够提供原子加法/减法功能。

1.2 API

在这里插入图片描述

1.2.1 spin_lock/spin_unlock

最常用的就是 spin_lock/spin_unlock 这一对API了，其使用：
动态定义：

spinlock_t lock;  //定义一个自旋锁变量
spin_lock_init(&lock)  //初始化自旋锁

静态定义：

DEFINE_SPINLOCK(lock)

使用：

spin_lock(&lock);  //加锁
//临界区
spin_unlock(&lock);  //解锁

// include/linux/spinlock_types.h

#define DEFINE_SPINLOCK(x)	spinlock_t x = __SPIN_LOCK_UNLOCKED(x)

// include/linux/spinlock.h

# define spin_lock_init(_lock)			\
do {						\
	spinlock_check(_lock);			\
	*(_lock) = __SPIN_LOCK_UNLOCKED(_lock);	\
} while (0)

// include/linux/spinlock.h

#define raw_spin_lock(lock)	_raw_spin_lock(lock)

static __always_inline void spin_lock(spinlock_t *lock)
{
	raw_spin_lock(&lock->rlock);
}

// kernel/locking/spinlock.c

void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
{
	__raw_spin_lock(lock);
}
EXPORT_SYMBOL(_raw_spin_lock);

//  /include/linux/spinlock_api_smp.h

static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
	preempt_disable();
	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

reempt_disable(): 禁用抢占，这将增加抢占计数并阻止当前 CPU 上的任务被抢占。
spin_acquire(): 这个函数调用用于表示获取自旋锁。
LOCK_CONTENDED(): 这个宏用于处理自旋锁在被占用时的情况。根据情况，可能会调用 do_raw_spin_trylock 或者 do_raw_spin_lock 函数来尝试获取自旋锁或者直接获取自旋锁。

static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
{
	__acquire(lock);
	arch_spin_lock(&lock->raw_lock);
	mmiowb_spin_lock();
}

// /include/asm-generic/qspinlock.h

#define arch_spin_lock(l)		queued_spin_lock(l)

流程图如下：
在这里插入图片描述
这种方式使用自旋锁是有一些局限性的。虽然自旋锁可以防止本地 CPU 的抢占，但它无法防止该 CPU 被中断“霸占”（即执行中断处理程序）。想象一种情况，CPU 持有一个“自旋锁”来保护某个资源，并发生了一个中断。CPU 将停止当前任务并跳转到这个中断处理程序。到目前为止，一切都很好。现在，想象一下这个中断处理程序需要获取同样的自旋锁（你可能已经猜到这个资源是与中断处理程序共享的）。它将会在原地无限自旋，试图获取一个已经被被抢占了的任务所持有的锁。这种情况被称为死锁。

为了解决这个问题，Linux 内核提供了针对自旋锁的 _irq 变种函数，除了禁用/启用抢占外，还会在本地 CPU 上禁用/启用中断。这些函数包括 spin_lock_irq() 和 spin_unlock_irq()。通过使用这些函数，可以在保护共享资源时，同时考虑到中断处理程序对于自旋锁的需求，从而避免了死锁的发生。

1.2.2 spin_lock_irq/spin_unlock_irq

#define raw_spin_lock_irq(lock)		_raw_spin_lock_irq(lock)

static __always_inline void spin_lock_irq(spinlock_t *lock)
{
	raw_spin_lock_irq(&lock->rlock);
}

void __lockfunc _raw_spin_lock_irq(raw_spinlock_t *lock)
{
	__raw_spin_lock_irq(lock);
}
EXPORT_SYMBOL(_raw_spin_lock_irq);

static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
{
	local_irq_disable();
	preempt_disable();
	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

local_irq_disable(): 这个函数用于在本地 CPU 上禁用中断，这可以防止当前 CPU 上的中断处理程序运行，从而确保在获取自旋锁期间不会被中断。
reempt_disable(): 禁用抢占，这将增加抢占计数并阻止当前 CPU 上的任务被抢占。
spin_acquire(): 这个函数调用用于表示获取自旋锁。
LOCK_CONTENDED(): 这个宏用于处理自旋锁在被占用时的情况。根据情况，可能会调用 do_raw_spin_trylock 或者 do_raw_spin_lock 函数来尝试获取自旋锁或者直接获取自旋锁。

_irq 变种函数只能部分解决这个问题。想象一下，在您的代码开始锁定之前，处理器上的中断已经被禁用。因此，当您调用 spin_unlock_irq() 时，不仅会释放锁，还会启用中断。然而，这可能会以错误的方式发生，因为 spin_unlock_irq() 无法知道在锁定之前哪些中断是已启用的，哪些是未启用的。

以下是一个简短的示例：
（1) 假设在获取自旋锁之前，中断 x 和 y 已被禁用，而 z 没有被禁用。
（2）spin_lock_irq() 会禁用这些中断（现在 x、y 和 z 都被禁用）并获取锁。
（3）spin_unlock_irq() 会启用这些中断。现在 x、y 和 z 都被启用了，而在获取锁之前不是这样。这就是问题所在。

这使得在从上下文中调用的 IRQ 中使用 spin_lock_irq() 不安全，因为它的对应函数 spin_unlock_irq() 会简单地启用 IRQ，存在启用那些在调用 spin_lock_irq() 时未被启用的 IRQ 的风险。只有在您知道中断已经被启用时才使用 spin_lock_irq() 才有意义；也就是说，您确定没有其他东西在本地 CPU 上禁用了中断。

在这种情况下，想象一下在获取锁之前将中断的状态保存在一个变量中，然后在释放锁时将中断恢复到获取锁时的状态。这样一来，就不会再有问题了。为了实现这一点，内核提供了 _irqsave 变种函数。这些函数的行为类似于 _irq 函数，同时还保存和恢复中断状态的特性。这些函数包括 spin_lock_irqsave() 和 spin_unlock_irqrestore()。

1.2.3 spin_lock_irqsave/spin_unlock_irqstore

#define spin_lock_irqsave(lock, flags)				\
do {								\
	raw_spin_lock_irqsave(spinlock_check(lock), flags);	\
} while (0)

#define raw_spin_lock_irqsave(lock, flags)			\
	do {						\
		typecheck(unsigned long, flags);	\
		flags = _raw_spin_lock_irqsave(lock);	\
	} while (0)

unsigned long __lockfunc _raw_spin_lock_irqsave(raw_spinlock_t *lock)
{
	return __raw_spin_lock_irqsave(lock);
}
EXPORT_SYMBOL(_raw_spin_lock_irqsave);

static inline unsigned long __raw_spin_lock_irqsave(raw_spinlock_t *lock)
{
	unsigned long flags;

	local_irq_save(flags);
	preempt_disable();
	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
	/*
	 * On lockdep we dont want the hand-coded irq-enable of
	 * do_raw_spin_lock_flags() code, because lockdep assumes
	 * that interrupts are not re-enabled during lock-acquire:
	 */
#ifdef CONFIG_LOCKDEP
	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
#else
	do_raw_spin_lock_flags(lock, &flags);
#endif
	return flags;
}

local_irq_save()：这个函数用于在本地 CPU 上保存当前中断状态到 flags 变量中，然后禁用中断。
preempt_disable()：禁用抢占，这会增加抢占计数并阻止当前 CPU 上的任务被抢占。
spin_acquire(): 这个函数调用用于表示获取自旋锁。
LOCK_CONTENDED(): 这个宏用于处理自旋锁在被占用时的情况。根据情况，可能会调用 do_raw_spin_trylock 或者 do_raw_spin_lock 函数来尝试获取自旋锁或者直接获取自旋锁。

备注：
凡是用到 spin_lock_irq()\spin_unlock_irq() 都可以用 spin_lock_irqsave()\spin_unlock_irqrestore() 替换，根据使用情况决定选择哪种方式即可，例如希望中断执行完成后，所有的中断都要开启，那就选择 spin_lock_irq()\spin_unlock_irq()，如果希望中断执行完成后，只需要恢复执行前的中断开关状态，那么就选择 spin_lock_irqsave()\spin_unlock_irqrestore()，如执行前 A中断本来就要求关闭的，那么执行完之后，还是希望 A中断仍处于关闭状态。

1.2.4 spin_lock_bh/spin_unlock_bh

#define raw_spin_lock_bh(lock)		_raw_spin_lock_bh(lock)

static __always_inline void spin_lock_bh(spinlock_t *lock)
{
	raw_spin_lock_bh(&lock->rlock);
}

void __lockfunc _raw_spin_lock_bh(raw_spinlock_t *lock)
{
	__raw_spin_lock_bh(lock);
}
EXPORT_SYMBOL(_raw_spin_lock_bh);

static inline void __raw_spin_lock_bh(raw_spinlock_t *lock)
{
	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

__local_bh_disable_ip()：这个函数用于在本地 CPU 上禁用软中断。
spin_acquire(): 这个函数调用用于表示获取自旋锁。
LOCK_CONTENDED(): 这个宏用于处理自旋锁在被占用时的情况。根据情况，可能会调用 do_raw_spin_trylock 或者 do_raw_spin_lock 函数来尝试获取自旋锁或者直接获取自旋锁。

spin_lock_bh/spin_unlock_bh 这对API用来关闭中断底半部，即关闭软中断。

1.2.5 补充

spin_lock() 及其所有变体会自动调用 preempt_disable()，这会在本地 CPU 上禁用抢占。另一方面，spin_unlock() 及其变体会调用 preempt_enable()，尝试启用抢占（是的，尝试！——这取决于其他自旋锁是否被锁定，这将影响抢占计数器的值），并在启用时（取决于计数器的当前值，应为 0）内部调用 schedule()。然后，spin_unlock() 是一个抢占点，可能会重新启用抢占。

禁用中断可能会防止内核抢占（调度器的定时器中断会被禁用），但并没有阻止受保护部分调用调度器（schedule() 函数）。许多内核函数间接调用调度器，比如处理自旋锁的函数。因此，即使是一个简单的 printk() 函数也可能会调用调度器，因为它涉及到保护内核消息缓冲区的自旋锁。内核通过增加或减少一个全局变量和每个 CPU 的变量（默认为 0，表示“启用”）来禁用或启用调度器（执行抢占），这个变量称为 preempt_count。当这个变量大于 0 时（由 schedule() 函数检查），调度器简单地返回并不执行任何操作。每当调用与 spin_lock* 相关的辅助函数时，这个变量就会增加 1。另一方面，释放自旋锁（任何 spin_unlock* 函数）会将其减少 1，每当它达到 0 时，调度器就会被调用，这意味着您的临界区可能不是原子化的。

因此，如果代码本身不会触发抢占，那么只能通过禁用中断来保护代码免受抢占。也就是说，已经锁定自旋锁的代码可能无法休眠，因为没有办法唤醒它（请记住，在本地 CPU 上定时器中断和调度器都被禁用）。

二、自选锁原理

自旋锁是一种基于硬件的锁原语。它依赖于当前硬件的能力来提供原子操作（比如test_and_set，在非原子实现中会涉及读取、修改和写入操作）。

通常，自选锁的实现只能通过特殊的汇编语言指令来实现，比如 atomic（即不可中断的）test-and-set 操作，并且在不支持真正原子操作的编程语言中无法轻松实现。

Linux 对于自选锁的实现可以分为三种：
（1）老版本（2.6.25之前）的Linux内核的自旋锁

typedef struct {
	volatile unsigned int slock;
} raw_spinlock_t;

raw_spinlock_t结构体使用一个unsigned int slock表示即可，slock 等于0表示locked，1表示unlocked。

在没有加锁的情况下，线程可以获得该锁，然后将变量置为1。其它线程由于发现该值为1，所以只能等待。而当线程解锁的时候，将该变量设置为0，此时其它变量就可以进行加锁了。或者将加锁和未加锁的标示反过来，也就是0表示加锁，1表示未加锁。这个都无所谓，只是一种状态标识。

// v2.6.20/source/include/linux/spinlock.h

#define spin_lock(lock)			_spin_lock(lock)

// v2.6.20/source/kernel/spinlock.c

void __lockfunc _spin_lock(spinlock_t *lock)
{
	preempt_disable();
	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
	_raw_spin_lock(lock);
}

// v2.6.20/source/include/linux/spinlock.h

# define _raw_spin_lock(lock)		__raw_spin_lock(&(lock)->raw_lock)

// v2.6.20/source/include/asm-x86_64/spinlock.h

static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
	asm volatile(
		"\n1:\t"
		LOCK_PREFIX " ; decl %0\n\t"
		"jns 2f\n"
		"3:\n"
		"rep;nop\n\t"
		"cmpl $0,%0\n\t"
		"jle 3b\n\t"
		"jmp 1b\n"
		"2:\t" : "=m" (lock->slock) : : "memory");
}

（1）LOCK_PREFIX 是一个用于添加 x86 平台上的锁定指令的宏。这个宏会根据不同的体系结构提供相应的锁定指令。
（2）decl %0 会将 lock 参数指向的变量减一。如果结果不为负数（jns 指令用于判断），则自旋锁已成功获取。
（3）如果减一操作后结果为负数，代码会循环等待（自旋），在 3: 标签处执行 rep;nop 等待指令。
（4）在等待循环中，代码不断检查自旋锁的值，如果值小于等于0，则继续等待。
（5）如果自旋锁的值大于0，则通过 jmp 1b 返回到第1步继续尝试获取自旋锁。

// v2.6.20/source/include/asm-x86_64/spinlock.h

static inline void __raw_spin_unlock(raw_spinlock_t *lock)
{
	asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory");
}

（1）movl $1, %0 会将值1移动到 lock->slock 所指向的变量，这个操作将自旋锁的值设置为1，表示自旋锁已经被释放。
（2）“=m” (lock->slock) 表示将 lock->slock 的值输出到内存中，以更新自旋锁的状态。
（3）memory 标记告诉编译器不要对内存操作进行任何优化，以确保在释放自旋锁时对内存的操作不会被重新排序或优化掉。

这个版本的spin lock的实现当然可以实现功能，而且在没有冲突的时候表现出不错的性能，不过存在一个问题：不公平。也就是所有的thread都是在无序的争抢spin lock，谁先抢到谁先得，不管thread等了很久还是刚刚开始spin。在冲突比较少的情况下，不公平不会体现的特别明显，然而，随着硬件的发展，多核处理器的数目越来越多，多核之间的冲突越来越剧烈，无序竞争的spinlock带来性能问题。

（2） 2.6.25以后的Linux内核的自旋锁
2.6.25以后的Linux内核的自旋锁称为 ticket spinlock，基于 FIFO 算法的排队自选锁。

（3） 4.2.0 以后的Linux内核的自旋锁
4.2.0以后的Linux内核的自旋锁称为 queued spinlock，基于 MCS 算法的排队自选锁。
目前高版本的自选锁都是queued spinlock。

三、自旋锁在内核的使用

以下内核源码来自于 Linux 5.15.0，即queued spinlock。

3.1 struct file

struct file {
	/*
	 * Protects f_ep, f_flags.
	 * Must not be taken from IRQ context.
	 */
	spinlock_t		f_lock;
	......
	unsigned int 		f_flags;
	fmode_t			f_mode;
	......
}

spinlock_t f_lock 初始化：

/**
 * alloc_file - allocate and initialize a 'struct file'
 *
 * @path: the (dentry, vfsmount) pair for the new file
 * @flags: O_... flags with which the new file will be opened
 * @fop: the 'struct file_operations' for the new file
 */
static struct file *alloc_file(const struct path *path, int flags,
		const struct file_operations *fop)
{
	struct file *file;

	file = alloc_empty_file(flags, current_cred());

	file->f_path = *path;
	file->f_inode = path->dentry->d_inode;
	file->f_mapping = path->dentry->d_inode->i_mapping;
	......
	file->f_mode |= FMODE_OPENED;
	file->f_op = fop;
	......
	return file;
}

alloc_file()
	-->alloc_empty_file()
		-->__alloc_file()

static struct file *__alloc_file(int flags, const struct cred *cred)
{
	struct file *f;

	f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
	
	spin_lock_init(&f->f_lock);
	mutex_init(&f->f_pos_lock);
	f->f_flags = flags;
	f->f_mode = OPEN_FMODE(flags);
	
	return f;
}

spinlock_t f_lock的使用：

static int setfl(int fd, struct file * filp, unsigned long arg)
{
	......
	spin_lock(&filp->f_lock);
	filp->f_flags = (arg & SETFL_MASK) | (filp->f_flags & ~SETFL_MASK);
	spin_unlock(&filp->f_lock);
	......
}

保护 file 的 f_flags 成员变量。

int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
{
	switch (advice) {
	case POSIX_FADV_NORMAL:
		file->f_ra.ra_pages = bdi->ra_pages;
		spin_lock(&file->f_lock);
		file->f_mode &= ~FMODE_RANDOM;
		spin_unlock(&file->f_lock);
		break;
	case POSIX_FADV_RANDOM:
		spin_lock(&file->f_lock);
		file->f_mode |= FMODE_RANDOM;
		spin_unlock(&file->f_lock);
		break;
	case POSIX_FADV_SEQUENTIAL:
		file->f_ra.ra_pages = bdi->ra_pages * 2;
		spin_lock(&file->f_lock);
		file->f_mode &= ~FMODE_RANDOM;
		spin_unlock(&file->f_lock);
		break;
	......
	}
}

保护 file 的 f_mode 成员变量。

3.2 struct dentry

#define USE_CMPXCHG_LOCKREF \
	(IS_ENABLED(CONFIG_ARCH_USE_CMPXCHG_LOCKREF) && \
	 IS_ENABLED(CONFIG_SMP) && SPINLOCK_SIZE <= 4)

struct lockref {
	union {
#if USE_CMPXCHG_LOCKREF
		aligned_u64 lock_count;
#endif
		struct {
			spinlock_t lock;
			int count;
		};
	};
};

#define d_lock	d_lockref.lock

struct dentry {
	/* Ref lookup also touches following */
	struct lockref d_lockref;	/* per-dentry lock and refcount */
}

/*
 * dentry->d_lock spinlock nesting subclasses:
 *
 * 0: normal
 * 1: nested
 */
enum dentry_d_lock_class
{
	DENTRY_D_LOCK_NORMAL, /* implicitly used by plain spin_lock() APIs. */
	DENTRY_D_LOCK_NESTED
};

使用比如例程1：

/**
 * d_alloc	-	allocate a dcache entry
 * @parent: parent of entry to allocate
 * @name: qstr of the name
 *
 * Allocates a dentry. It returns %NULL if there is insufficient memory
 * available. On a success the dentry is returned. The name passed in is
 * copied and the copy passed in may be reused after this call.
 */
struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
{
	struct dentry *dentry = __d_alloc(parent->d_sb, name);
	if (!dentry)
		return NULL;
	spin_lock(&parent->d_lock);
	/*
	 * don't need child lock because it is not subject
	 * to concurrency here
	 */
	__dget_dlock(parent);
	dentry->d_parent = parent;
	list_add(&dentry->d_child, &parent->d_subdirs);
	spin_unlock(&parent->d_lock);

	return dentry;
}
EXPORT_SYMBOL(d_alloc);

分配dentry时初始化spinlock_t：

/**
 * __d_alloc	-	allocate a dcache entry
 * @sb: filesystem it will belong to
 * @name: qstr of the name
 *
 * Allocates a dentry. It returns %NULL if there is insufficient memory
 * available. On a success the dentry is returned. The name passed in is
 * copied and the copy passed in may be reused after this call.
 */
 
static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
{
	struct dentry *dentry;
	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
	......
	dentry->d_lockref.count = 1;
	spin_lock_init(&dentry->d_lock);
	seqcount_spinlock_init(&dentry->d_seq, &dentry->d_lock);
	......
}

例程2：

/**
 * __d_lookup - search for a dentry (racy)
 * @parent: parent dentry
 * @name: qstr of name we wish to find
 * Returns: dentry, or NULL
 *
 * __d_lookup is like d_lookup, however it may (rarely) return a
 * false-negative result due to unrelated rename activity.
 *
 * __d_lookup is slightly faster by avoiding rename_lock read seqlock,
 * however it must be used carefully, eg. with a following d_lookup in
 * the case of failure.
 *
 * __d_lookup callers must be commented.
 */
struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
{
	unsigned int hash = name->hash;
	struct hlist_bl_head *b = d_hash(hash);
	struct hlist_bl_node *node;
	struct dentry *found = NULL;
	struct dentry *dentry;

	/*
	 * Note: There is significant duplication with __d_lookup_rcu which is
	 * required to prevent single threaded performance regressions
	 * especially on architectures where smp_rmb (in seqcounts) are costly.
	 * Keep the two functions in sync.
	 */

	/*
	 * The hash list is protected using RCU.
	 *
	 * Take d_lock when comparing a candidate dentry, to avoid races
	 * with d_move().
	 *
	 * It is possible that concurrent renames can mess up our list
	 * walk here and result in missing our dentry, resulting in the
	 * false-negative result. d_lookup() protects against concurrent
	 * renames using rename_lock seqlock.
	 *
	 * See Documentation/filesystems/path-lookup.txt for more details.
	 */
	rcu_read_lock();
	
	hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {

		if (dentry->d_name.hash != hash)
			continue;

		spin_lock(&dentry->d_lock);
		if (dentry->d_parent != parent)
			goto next;
		if (d_unhashed(dentry))
			goto next;

		if (!d_same_name(dentry, parent, name))
			goto next;

		dentry->d_lockref.count++;
		found = dentry;
		spin_unlock(&dentry->d_lock);
		break;
next:
		spin_unlock(&dentry->d_lock);
 	}
 	rcu_read_unlock();

 	return found;
}

该函数用于在父目录项中搜索指定名称的目录项。

3.3 struct inode

struct inode {
	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
}

使用例程1：

/**
 * d_find_alias - grab a hashed alias of inode
 * @inode: inode in question
 *
 * If inode has a hashed alias, or is a directory and has any alias,
 * acquire the reference to alias and return it. Otherwise return NULL.
 * Notice that if inode is a directory there can be only one alias and
 * it can be unhashed only if it has no children, or if it is the root
 * of a filesystem, or if the directory was renamed and d_revalidate
 * was the first vfs operation to notice.
 *
 * If the inode has an IS_ROOT, DCACHE_DISCONNECTED alias, then prefer
 * any other hashed alias over that one.
 */
struct dentry *d_find_alias(struct inode *inode)
{
	struct dentry *de = NULL;

	if (!hlist_empty(&inode->i_dentry)) {
		spin_lock(&inode->i_lock);
		de = __d_find_alias(inode);
		spin_unlock(&inode->i_lock);
	}
	return de;
}
EXPORT_SYMBOL(d_find_alias);

使用例程2：

/*
 * When a file is deleted, we have two options:
 * - turn this dentry into a negative dentry
 * - unhash this dentry and free it.
 *
 * Usually, we want to just turn this into
 * a negative dentry, but if anybody else is
 * currently using the dentry or the inode
 * we can't do that and we fall back on removing
 * it from the hash queues and waiting for
 * it to be deleted later when it has no users
 */
 
/**
 * d_delete - delete a dentry
 * @dentry: The dentry to delete
 *
 * Turn the dentry into a negative dentry if possible, otherwise
 * remove it from the hash queues so it can be deleted later
 */
 
void d_delete(struct dentry * dentry)
{
	struct inode *inode = dentry->d_inode;

	spin_lock(&inode->i_lock);
	spin_lock(&dentry->d_lock);
	/*
	 * Are we the only user?
	 */
	if (dentry->d_lockref.count == 1) {
		dentry->d_flags &= ~DCACHE_CANT_MOUNT;
		dentry_unlink_inode(dentry);
	} else {
		__d_drop(dentry);
		spin_unlock(&dentry->d_lock);
		spin_unlock(&inode->i_lock);
	}
}
EXPORT_SYMBOL(d_delete);

使用例程3：

static struct inode *alloc_inode(struct super_block *sb)
{
	const struct super_operations *ops = sb->s_op;
	struct inode *inode;

	if (ops->alloc_inode)
		inode = ops->alloc_inode(sb);
	else
		inode = kmem_cache_alloc(inode_cachep, GFP_KERNEL);

	......
	inode_init_always(sb, inode);
	......
	return inode;
}

/**
 * inode_init_always - perform inode structure initialisation
 * @sb: superblock inode belongs to
 * @inode: inode to initialise
 *
 * These are initializations that need to be done on every inode
 * allocation as the fields are not initialised by slab allocation.
 */
int inode_init_always(struct super_block *sb, struct inode *inode)
{
	spin_lock_init(&inode->i_lock);
}

使用例程4：

/**
 * do_inode_permission - UNIX permission checking
 * @mnt_userns:	user namespace of the mount the inode was found from
 * @inode:	inode to check permissions on
 * @mask:	right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC ...)
 *
 * We _really_ want to just do "generic_permission()" without
 * even looking at the inode->i_op values. So we keep a cache
 * flag in inode->i_opflags, that says "this has not special
 * permission function, use the fast case".
 */
static inline int do_inode_permission(struct user_namespace *mnt_userns,
				      struct inode *inode, int mask)
{
	if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
		if (likely(inode->i_op->permission))
			return inode->i_op->permission(mnt_userns, inode, mask);

		/* This gets set once for the inode lifetime */
		spin_lock(&inode->i_lock);
		inode->i_opflags |= IOP_FASTPERM;
		spin_unlock(&inode->i_lock);
	}
	return generic_permission(mnt_userns, inode, mask);
}

3.4 struct super_block

struct super_block {
	......
	/* s_inode_list_lock protects s_inodes */
	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
	struct list_head	s_inodes;	/* all inodes */

	spinlock_t		s_inode_wblist_lock;
	struct list_head	s_inodes_wb;	/* writeback inodes */
}

spinlock_t s_inode_list_lock/s_inode_wblist_lock初始化：

/**
 *	alloc_super	-	create new superblock
 *	@type:	filesystem type superblock should belong to
 *	@flags: the mount flags
 *	@user_ns: User namespace for the super_block
 *
 *	Allocates and initializes a new &struct super_block.  alloc_super()
 *	returns a pointer new superblock or %NULL if allocation had failed.
 */
static struct super_block *alloc_super(struct file_system_type *type, int flags,
				       struct user_namespace *user_ns)
{
	struct super_block *s = kzalloc(sizeof(struct super_block),  GFP_USER);
	......
	INIT_LIST_HEAD(&s->s_inodes);
	spin_lock_init(&s->s_inode_list_lock);
	INIT_LIST_HEAD(&s->s_inodes_wb);
	spin_lock_init(&s->s_inode_wblist_lock);
	......
}

s_inode_list_lock的使用：

/**
 * inode_sb_list_add - add inode to the superblock list of inodes
 * @inode: inode to add
 */
void inode_sb_list_add(struct inode *inode)
{
	spin_lock(&inode->i_sb->s_inode_list_lock);
	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
	spin_unlock(&inode->i_sb->s_inode_list_lock);
}
EXPORT_SYMBOL_GPL(inode_sb_list_add);

static inline void inode_sb_list_del(struct inode *inode)
{
	if (!list_empty(&inode->i_sb_list)) {
		spin_lock(&inode->i_sb->s_inode_list_lock);
		list_del_init(&inode->i_sb_list);
		spin_unlock(&inode->i_sb->s_inode_list_lock);
	}

/**
 * invalidate_inodes	- attempt to free all inodes on a superblock
 * @sb:		superblock to operate on
 * @kill_dirty: flag to guide handling of dirty inodes
 *
 * Attempts to free all inodes for a given superblock.  If there were any
 * busy inodes return a non-zero value, else zero.
 * If @kill_dirty is set, discard dirty inodes too, otherwise treat
 * them as busy.
 */
int invalidate_inodes(struct super_block *sb, bool kill_dirty)
{
	int busy = 0;
	struct inode *inode, *next;
	LIST_HEAD(dispose);

again:
	spin_lock(&sb->s_inode_list_lock);
	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
		spin_lock(&inode->i_lock);
		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
			spin_unlock(&inode->i_lock);
			continue;
		}
		if (inode->i_state & I_DIRTY_ALL && !kill_dirty) {
			spin_unlock(&inode->i_lock);
			busy = 1;
			continue;
		}
		if (atomic_read(&inode->i_count)) {
			spin_unlock(&inode->i_lock);
			busy = 1;
			continue;
		}

		inode->i_state |= I_FREEING;
		inode_lru_list_del(inode);
		spin_unlock(&inode->i_lock);
		list_add(&inode->i_lru, &dispose);
		if (need_resched()) {
			spin_unlock(&sb->s_inode_list_lock);
			cond_resched();
			dispose_list(&dispose);
			goto again;
		}
	}
	spin_unlock(&sb->s_inode_list_lock);

	dispose_list(&dispose);

	return busy;
}

/**
 * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
 * @sb: superblock being unmounted.
 *
 * Called during unmount with no locks held, so needs to be safe against
 * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
 */
static void fsnotify_unmount_inodes(struct super_block *sb)
{
	struct inode *inode, *iput_inode = NULL;

	spin_lock(&sb->s_inode_list_lock);
	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
		/*
		 * We cannot __iget() an inode in state I_FREEING,
		 * I_WILL_FREE, or I_NEW which is fine because by that point
		 * the inode cannot have any associated watches.
		 */
		spin_lock(&inode->i_lock);
		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
			spin_unlock(&inode->i_lock);
			continue;
		}

		/*
		 * If i_count is zero, the inode cannot have any watches and
		 * doing an __iget/iput with SB_ACTIVE clear would actually
		 * evict all inodes with zero i_count from icache which is
		 * unnecessarily violent and may in fact be illegal to do.
		 * However, we should have been called /after/ evict_inodes
		 * removed all zero refcount inodes, in any case.  Test to
		 * be sure.
		 */
		if (!atomic_read(&inode->i_count)) {
			spin_unlock(&inode->i_lock);
			continue;
		}

		__iget(inode);
		spin_unlock(&inode->i_lock);
		spin_unlock(&sb->s_inode_list_lock);

		if (iput_inode)
			iput(iput_inode);

		/* for each watch, send FS_UNMOUNT and then remove it */
		fsnotify_inode(inode, FS_UNMOUNT);

		fsnotify_inode_delete(inode);

		iput_inode = inode;

		cond_resched();
		spin_lock(&sb->s_inode_list_lock);
	}
	spin_unlock(&sb->s_inode_list_lock);

	if (iput_inode)
		iput(iput_inode);
}

s_inode_wblist_lock的使用：

// linux-5.15/fs/fs-writeback.c

/*
 * mark an inode as under writeback on the sb
 */
void sb_mark_inode_writeback(struct inode *inode)
{
	struct super_block *sb = inode->i_sb;
	unsigned long flags;

	if (list_empty(&inode->i_wb_list)) {
		spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
		if (list_empty(&inode->i_wb_list)) {
			list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
			trace_sb_mark_inode_writeback(inode);
		}
		spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
	}
}

这个函数用于将给定的索引节点标记为正在写回的状态，并将其添加到超级块的写回列表中。函数首先检查索引节点的写回列表是否为空，如果是空的，则获取超级块的写回列表锁，然后再次检查是否为空以避免重复添加。如果索引节点的写回列表为空，则将该索引节点添加到超级块的写回列表中，并记录写回操作的跟踪信息。

// linux-5.15/fs/fs-writeback.c

/*
 * clear an inode as under writeback on the sb
 */
void sb_clear_inode_writeback(struct inode *inode)
{
	struct super_block *sb = inode->i_sb;
	unsigned long flags;

	if (!list_empty(&inode->i_wb_list)) {
		spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
		if (!list_empty(&inode->i_wb_list)) {
			list_del_init(&inode->i_wb_list);
			trace_sb_clear_inode_writeback(inode);
		}
		spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
	}
}

这个函数用于将给定的索引节点从超级块的写回列表中清除，表示该索引节点不再处于写回状态。函数首先检查索引节点的写回列表是否非空，如果非空，则获取超级块的写回列表锁，然后再次检查是否非空以确保正确性。如果索引节点的写回列表非空，则将该索引节点从超级块的写回列表中删除，并记录清除写回状态的跟踪信息。

// linux-5.15/fs/fs-writeback.c

/*
 * The @s_sync_lock is used to serialise concurrent sync operations
 * to avoid lock contention problems with concurrent wait_sb_inodes() calls.
 * Concurrent callers will block on the s_sync_lock rather than doing contending
 * walks. The queueing maintains sync(2) required behaviour as all the IO that
 * has been issued up to the time this function is enter is guaranteed to be
 * completed by the time we have gained the lock and waited for all IO that is
 * in progress regardless of the order callers are granted the lock.
 */
static void wait_sb_inodes(struct super_block *sb)
{
	LIST_HEAD(sync_list);

	/*
	 * We need to be protected against the filesystem going from
	 * r/o to r/w or vice versa.
	 */
	WARN_ON(!rwsem_is_locked(&sb->s_umount));

	mutex_lock(&sb->s_sync_lock);

	/*
	 * Splice the writeback list onto a temporary list to avoid waiting on
	 * inodes that have started writeback after this point.
	 *
	 * Use rcu_read_lock() to keep the inodes around until we have a
	 * reference. s_inode_wblist_lock protects sb->s_inodes_wb as well as
	 * the local list because inodes can be dropped from either by writeback
	 * completion.
	 */
	rcu_read_lock();
	spin_lock_irq(&sb->s_inode_wblist_lock);
	list_splice_init(&sb->s_inodes_wb, &sync_list);

	/*
	 * Data integrity sync. Must wait for all pages under writeback, because
	 * there may have been pages dirtied before our sync call, but which had
	 * writeout started before we write it out.  In which case, the inode
	 * may not be on the dirty list, but we still have to wait for that
	 * writeout.
	 */
	while (!list_empty(&sync_list)) {
		struct inode *inode = list_first_entry(&sync_list, struct inode,
						       i_wb_list);
		struct address_space *mapping = inode->i_mapping;

		/*
		 * Move each inode back to the wb list before we drop the lock
		 * to preserve consistency between i_wb_list and the mapping
		 * writeback tag. Writeback completion is responsible to remove
		 * the inode from either list once the writeback tag is cleared.
		 */
		list_move_tail(&inode->i_wb_list, &sb->s_inodes_wb);

		/*
		 * The mapping can appear untagged while still on-list since we
		 * do not have the mapping lock. Skip it here, wb completion
		 * will remove it.
		 */
		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
			continue;

		spin_unlock_irq(&sb->s_inode_wblist_lock);

		spin_lock(&inode->i_lock);
		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
			spin_unlock(&inode->i_lock);

			spin_lock_irq(&sb->s_inode_wblist_lock);
			continue;
		}
		__iget(inode);
		spin_unlock(&inode->i_lock);
		rcu_read_unlock();

		/*
		 * We keep the error status of individual mapping so that
		 * applications can catch the writeback error using fsync(2).
		 * See filemap_fdatawait_keep_errors() for details.
		 */
		filemap_fdatawait_keep_errors(mapping);

		cond_resched();

		iput(inode);

		rcu_read_lock();
		spin_lock_irq(&sb->s_inode_wblist_lock);
	}
	spin_unlock_irq(&sb->s_inode_wblist_lock);
	rcu_read_unlock();
	mutex_unlock(&sb->s_sync_lock);
}