Linux 内核自旋锁spinlock(一)
文章目录
- 前言
- 一、自旋锁
- 1.1 简介
- 1.2 API
- 1.2.1 spin_lock/spin_unlock
- 1.2.2 spin_lock_irq/spin_unlock_irq
- 1.2.3 spin_lock_irqsave/spin_unlock_irqstore
- 1.2.4 spin_lock_bh/spin_unlock_bh
- 1.2.5 补充
- 二、自选锁原理
- 三、自旋锁在内核的使用
- 3.1 struct file
- 3.2 struct dentry
- 3.3 struct inode
- 3.4 struct super_block
- 参考资料
前言
在软件工程中,自旋锁是一种用于多线程应用程序中的同步原语,用于保护共享资源。当一个线程尝试获取自旋锁但发现它已被另一个线程持有时,该线程不会立即进入睡眠状态(如传统互斥锁的情况),而是会在一个循环中持续“自旋”,不断检查锁是否可用。这种行为被称为忙等待。
自旋锁能够避免操作系统进程重新调度或上下文切换的开销,因此在线程可能仅会短暂阻塞的情况下,自旋锁非常高效。出于这个原因,操作系统内核经常使用自旋锁。
对于操作系统内核:
在SMP系统上,各个CPU可能同时处于核心态,在理论上可以操作所有现存的数据结构。为阻止CPU彼此干扰,需要通过锁保护内核的某些范围。锁可以确保每次只能有一个CPU访问被保护的范围。
内核可以不受限制地访问整个地址空间。在多处理器系统上(或类似地,在启用了内核抢占的单处理器系统上),这会引起一些问题。如果几个处理器同时处于核心态,则理论上它们可以同时访问同一个数据结构。
一、自旋锁
1.1 简介
内核当发生访问资源冲突的时候,可以有两种锁的解决方案选择:
(1)一个是原地等待
(2)一个是挂起当前进程,调度其他进程执行(睡眠)
自旋锁就是原地等待,直到锁释放。
中断上下文要用锁,首选 spinlock。
自旋锁:这些是最常用的锁选项。它们用于短期保护某段代码,以防止其他处理器的访问。在内核等待自旋锁释放时,会重复检查是否能获取锁,而不会进入睡眠状态(忙等待)。当然,如果等待时间较长,则效率显然不高。
自旋锁用于保护短的代码段,其中只包含少量C语句,因此会很快执行完毕。大多数内核数据结构都有自身的自旋锁,在处理结构中的关键成员时,必须获得相应的自旋锁。自旋锁在内核源代码中普遍存在。如下图所示:
当运行任务 B 的 CPU(B) 想要通过自旋锁的锁定函数获取锁,而这个自旋锁已经被另一个 CPU(例如运行任务 A 的 CPU(A))持有时,CPU(B) 将简单地在一个循环中旋转,从而阻塞任务 B,直到另一个 CPU 释放该锁(任务 A 调用自旋锁的释放函数)。这种自旋只会发生在多核机器上,这就解释了之前描述的使用情况,因为涉及多个 CPU,所以在单核机器上不会发生:任务要么持有自旋锁并继续执行,要么直到锁被释放才能运行。自旋锁是由 CPU 持有的锁,这与互斥锁相反,互斥锁是由任务持有的锁。自旋锁通过禁用本地 CPU 上的调度程序运行(即运行调用自旋锁的锁定 API 的任务的 CPU)。这也意味着当前在该 CPU 上运行的任务不能被另一个任务抢占,除非 IRQs 被禁用(稍后会详细讨论)。换句话说,自旋锁保护只能由一个 CPU 一次获取/访问的资源。这使得自旋锁适用于 SMP 安全性和执行原子任务。
自旋锁并不是唯一利用硬件原子功能的实现方式。例如,在 Linux 内核中,抢占状态取决于每个 CPU 变量,如果等于 0,则表示抢占已启用。然而,如果大于 0,则表示抢占已禁用(schedule() 变为无效)。因此,禁用抢占(preempt_disable())包括向当前每个 CPU 变量(实际为 preempt_count)添加 1,而 preempt_enable() 则从变量中减去 1,检查新值是否为 0,并调用 schedule()。这些加法/减法操作应该是原子的,因此依赖于 CPU 能够提供原子加法/减法功能。
1.2 API
1.2.1 spin_lock/spin_unlock
最常用的就是 spin_lock/spin_unlock 这一对API了,其使用:
动态定义:
spinlock_t lock; //定义一个自旋锁变量
spin_lock_init(&lock) //初始化自旋锁
静态定义:
DEFINE_SPINLOCK(lock)
使用:
spin_lock(&lock); //加锁
//临界区
spin_unlock(&lock); //解锁
// include/linux/spinlock_types.h
#define DEFINE_SPINLOCK(x) spinlock_t x = __SPIN_LOCK_UNLOCKED(x)
// include/linux/spinlock.h
# define spin_lock_init(_lock) \
do { \
spinlock_check(_lock); \
*(_lock) = __SPIN_LOCK_UNLOCKED(_lock); \
} while (0)
// include/linux/spinlock.h
#define raw_spin_lock(lock) _raw_spin_lock(lock)
static __always_inline void spin_lock(spinlock_t *lock)
{
raw_spin_lock(&lock->rlock);
}
// kernel/locking/spinlock.c
void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
{
__raw_spin_lock(lock);
}
EXPORT_SYMBOL(_raw_spin_lock);
// /include/linux/spinlock_api_smp.h
static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}
reempt_disable(): 禁用抢占,这将增加抢占计数并阻止当前 CPU 上的任务被抢占。
spin_acquire(): 这个函数调用用于表示获取自旋锁。
LOCK_CONTENDED(): 这个宏用于处理自旋锁在被占用时的情况。根据情况,可能会调用 do_raw_spin_trylock 或者 do_raw_spin_lock 函数来尝试获取自旋锁或者直接获取自旋锁。
static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
{
__acquire(lock);
arch_spin_lock(&lock->raw_lock);
mmiowb_spin_lock();
}
// /include/asm-generic/qspinlock.h
#define arch_spin_lock(l) queued_spin_lock(l)
流程图如下:
这种方式使用自旋锁是有一些局限性的。虽然自旋锁可以防止本地 CPU 的抢占,但它无法防止该 CPU 被中断“霸占”(即执行中断处理程序)。想象一种情况,CPU 持有一个“自旋锁”来保护某个资源,并发生了一个中断。CPU 将停止当前任务并跳转到这个中断处理程序。到目前为止,一切都很好。现在,想象一下这个中断处理程序需要获取同样的自旋锁(你可能已经猜到这个资源是与中断处理程序共享的)。它将会在原地无限自旋,试图获取一个已经被被抢占了的任务所持有的锁。这种情况被称为死锁。
为了解决这个问题,Linux 内核提供了针对自旋锁的 _irq 变种函数,除了禁用/启用抢占外,还会在本地 CPU 上禁用/启用中断。这些函数包括 spin_lock_irq() 和 spin_unlock_irq()。通过使用这些函数,可以在保护共享资源时,同时考虑到中断处理程序对于自旋锁的需求,从而避免了死锁的发生。
1.2.2 spin_lock_irq/spin_unlock_irq
#define raw_spin_lock_irq(lock) _raw_spin_lock_irq(lock)
static __always_inline void spin_lock_irq(spinlock_t *lock)
{
raw_spin_lock_irq(&lock->rlock);
}
void __lockfunc _raw_spin_lock_irq(raw_spinlock_t *lock)
{
__raw_spin_lock_irq(lock);
}
EXPORT_SYMBOL(_raw_spin_lock_irq);
static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
{
local_irq_disable();
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}
local_irq_disable(): 这个函数用于在本地 CPU 上禁用中断,这可以防止当前 CPU 上的中断处理程序运行,从而确保在获取自旋锁期间不会被中断。
reempt_disable(): 禁用抢占,这将增加抢占计数并阻止当前 CPU 上的任务被抢占。
spin_acquire(): 这个函数调用用于表示获取自旋锁。
LOCK_CONTENDED(): 这个宏用于处理自旋锁在被占用时的情况。根据情况,可能会调用 do_raw_spin_trylock 或者 do_raw_spin_lock 函数来尝试获取自旋锁或者直接获取自旋锁。
_irq 变种函数只能部分解决这个问题。想象一下,在您的代码开始锁定之前,处理器上的中断已经被禁用。因此,当您调用 spin_unlock_irq() 时,不仅会释放锁,还会启用中断。然而,这可能会以错误的方式发生,因为 spin_unlock_irq() 无法知道在锁定之前哪些中断是已启用的,哪些是未启用的。
以下是一个简短的示例:
(1) 假设在获取自旋锁之前,中断 x 和 y 已被禁用,而 z 没有被禁用。
(2)spin_lock_irq() 会禁用这些中断(现在 x、y 和 z 都被禁用)并获取锁。
(3)spin_unlock_irq() 会启用这些中断。现在 x、y 和 z 都被启用了,而在获取锁之前不是这样。这就是问题所在。
这使得在从上下文中调用的 IRQ 中使用 spin_lock_irq() 不安全,因为它的对应函数 spin_unlock_irq() 会简单地启用 IRQ,存在启用那些在调用 spin_lock_irq() 时未被启用的 IRQ 的风险。只有在您知道中断已经被启用时才使用 spin_lock_irq() 才有意义;也就是说,您确定没有其他东西在本地 CPU 上禁用了中断。
在这种情况下,想象一下在获取锁之前将中断的状态保存在一个变量中,然后在释放锁时将中断恢复到获取锁时的状态。这样一来,就不会再有问题了。为了实现这一点,内核提供了 _irqsave 变种函数。这些函数的行为类似于 _irq 函数,同时还保存和恢复中断状态的特性。这些函数包括 spin_lock_irqsave() 和 spin_unlock_irqrestore()。
1.2.3 spin_lock_irqsave/spin_unlock_irqstore
#define spin_lock_irqsave(lock, flags) \
do { \
raw_spin_lock_irqsave(spinlock_check(lock), flags); \
} while (0)
#define raw_spin_lock_irqsave(lock, flags) \
do { \
typecheck(unsigned long, flags); \
flags = _raw_spin_lock_irqsave(lock); \
} while (0)
unsigned long __lockfunc _raw_spin_lock_irqsave(raw_spinlock_t *lock)
{
return __raw_spin_lock_irqsave(lock);
}
EXPORT_SYMBOL(_raw_spin_lock_irqsave);
static inline unsigned long __raw_spin_lock_irqsave(raw_spinlock_t *lock)
{
unsigned long flags;
local_irq_save(flags);
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
/*
* On lockdep we dont want the hand-coded irq-enable of
* do_raw_spin_lock_flags() code, because lockdep assumes
* that interrupts are not re-enabled during lock-acquire:
*/
#ifdef CONFIG_LOCKDEP
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
#else
do_raw_spin_lock_flags(lock, &flags);
#endif
return flags;
}
local_irq_save(): 这个函数用于在本地 CPU 上保存当前中断状态到 flags 变量中,然后禁用中断。
preempt_disable(): 禁用抢占,这会增加抢占计数并阻止当前 CPU 上的任务被抢占。
spin_acquire(): 这个函数调用用于表示获取自旋锁。
LOCK_CONTENDED(): 这个宏用于处理自旋锁在被占用时的情况。根据情况,可能会调用 do_raw_spin_trylock 或者 do_raw_spin_lock 函数来尝试获取自旋锁或者直接获取自旋锁。
备注:
凡是用到 spin_lock_irq()\spin_unlock_irq() 都可以用 spin_lock_irqsave()\spin_unlock_irqrestore() 替换,根据使用情况决定选择哪种方式即可,例如希望中断执行完成后,所有的中断都要开启,那就选择 spin_lock_irq()\spin_unlock_irq(),如果希望中断执行完成后,只需要恢复执行前的中断开关状态,那么就选择 spin_lock_irqsave()\spin_unlock_irqrestore(),如执行前 A中断 本来就要求关闭的,那么执行完之后,还是希望 A中断 仍处于关闭状态。
1.2.4 spin_lock_bh/spin_unlock_bh
#define raw_spin_lock_bh(lock) _raw_spin_lock_bh(lock)
static __always_inline void spin_lock_bh(spinlock_t *lock)
{
raw_spin_lock_bh(&lock->rlock);
}
void __lockfunc _raw_spin_lock_bh(raw_spinlock_t *lock)
{
__raw_spin_lock_bh(lock);
}
EXPORT_SYMBOL(_raw_spin_lock_bh);
static inline void __raw_spin_lock_bh(raw_spinlock_t *lock)
{
__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}
__local_bh_disable_ip(): 这个函数用于在本地 CPU 上禁用软中断。
spin_acquire(): 这个函数调用用于表示获取自旋锁。
LOCK_CONTENDED(): 这个宏用于处理自旋锁在被占用时的情况。根据情况,可能会调用 do_raw_spin_trylock 或者 do_raw_spin_lock 函数来尝试获取自旋锁或者直接获取自旋锁。
spin_lock_bh/spin_unlock_bh 这对API用来关闭中断底半部,即关闭软中断。
1.2.5 补充
spin_lock() 及其所有变体会自动调用 preempt_disable(),这会在本地 CPU 上禁用抢占。另一方面,spin_unlock() 及其变体会调用 preempt_enable(),尝试启用抢占(是的,尝试!——这取决于其他自旋锁是否被锁定,这将影响抢占计数器的值),并在启用时(取决于计数器的当前值,应为 0)内部调用 schedule()。然后,spin_unlock() 是一个抢占点,可能会重新启用抢占。
禁用中断可能会防止内核抢占(调度器的定时器中断会被禁用),但并没有阻止受保护部分调用调度器(schedule() 函数)。许多内核函数间接调用调度器,比如处理自旋锁的函数。因此,即使是一个简单的 printk() 函数也可能会调用调度器,因为它涉及到保护内核消息缓冲区的自旋锁。内核通过增加或减少一个全局变量和每个 CPU 的变量(默认为 0,表示“启用”)来禁用或启用调度器(执行抢占),这个变量称为 preempt_count。当这个变量大于 0 时(由 schedule() 函数检查),调度器简单地返回并不执行任何操作。每当调用与 spin_lock* 相关的辅助函数时,这个变量就会增加 1。另一方面,释放自旋锁(任何 spin_unlock* 函数)会将其减少 1,每当它达到 0 时,调度器就会被调用,这意味着您的临界区可能不是原子化的。
因此,如果代码本身不会触发抢占,那么只能通过禁用中断来保护代码免受抢占。也就是说,已经锁定自旋锁的代码可能无法休眠,因为没有办法唤醒它(请记住,在本地 CPU 上定时器中断和调度器都被禁用)。
二、自选锁原理
自旋锁是一种基于硬件的锁原语。它依赖于当前硬件的能力来提供原子操作(比如test_and_set,在非原子实现中会涉及读取、修改和写入操作)。
通常,自选锁的实现只能通过特殊的汇编语言指令来实现,比如 atomic(即不可中断的)test-and-set 操作,并且在不支持真正原子操作的编程语言中无法轻松实现。
Linux 对于自选锁的实现可以分为三种:
(1)老版本(2.6.25之前)的Linux内核的自旋锁
typedef struct {
volatile unsigned int slock;
} raw_spinlock_t;
raw_spinlock_t结构体使用一个unsigned int slock表示即可,slock 等于0表示locked,1表示unlocked。
在没有加锁的情况下,线程可以获得该锁,然后将变量置为1。其它线程由于发现该值为1,所以只能等待。而当线程解锁的时候,将该变量设置为0,此时其它变量就可以进行加锁了。或者将加锁和未加锁的标示反过来,也就是0表示加锁,1表示未加锁。这个都无所谓,只是一种状态标识。
// v2.6.20/source/include/linux/spinlock.h
#define spin_lock(lock) _spin_lock(lock)
// v2.6.20/source/kernel/spinlock.c
void __lockfunc _spin_lock(spinlock_t *lock)
{
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
}
// v2.6.20/source/include/linux/spinlock.h
# define _raw_spin_lock(lock) __raw_spin_lock(&(lock)->raw_lock)
// v2.6.20/source/include/asm-x86_64/spinlock.h
static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
asm volatile(
"\n1:\t"
LOCK_PREFIX " ; decl %0\n\t"
"jns 2f\n"
"3:\n"
"rep;nop\n\t"
"cmpl $0,%0\n\t"
"jle 3b\n\t"
"jmp 1b\n"
"2:\t" : "=m" (lock->slock) : : "memory");
}
(1)LOCK_PREFIX 是一个用于添加 x86 平台上的锁定指令的宏。这个宏会根据不同的体系结构提供相应的锁定指令。
(2)decl %0 会将 lock 参数指向的变量减一。如果结果不为负数(jns 指令用于判断),则自旋锁已成功获取。
(3)如果减一操作后结果为负数,代码会循环等待(自旋),在 3: 标签处执行 rep;nop 等待指令。
(4)在等待循环中,代码不断检查自旋锁的值,如果值小于等于0,则继续等待。
(5)如果自旋锁的值大于0,则通过 jmp 1b 返回到第1步继续尝试获取自旋锁。
// v2.6.20/source/include/asm-x86_64/spinlock.h
static inline void __raw_spin_unlock(raw_spinlock_t *lock)
{
asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory");
}
(1)movl $1, %0 会将值1移动到 lock->slock 所指向的变量,这个操作将自旋锁的值设置为1,表示自旋锁已经被释放。
(2)“=m” (lock->slock) 表示将 lock->slock 的值输出到内存中,以更新自旋锁的状态。
(3)memory 标记告诉编译器不要对内存操作进行任何优化,以确保在释放自旋锁时对内存的操作不会被重新排序或优化掉。
这个版本的spin lock的实现当然可以实现功能,而且在没有冲突的时候表现出不错的性能,不过存在一个问题:不公平。也就是所有的thread都是在无序的争抢spin lock,谁先抢到谁先得,不管thread等了很久还是刚刚开始spin。在冲突比较少的情况下,不公平不会体现的特别明显,然而,随着硬件的发展,多核处理器的数目越来越多,多核之间的冲突越来越剧烈,无序竞争的spinlock带来性能问题。
(2) 2.6.25以后的Linux内核的自旋锁
2.6.25以后的Linux内核的自旋锁 称为 ticket spinlock,基于 FIFO 算法的排队自选锁。
(3) 4.2.0 以后的Linux内核的自旋锁
4.2.0以后的Linux内核的自旋锁 称为 queued spinlock,基于 MCS 算法的排队自选锁。
目前高版本的自选锁都是queued spinlock。
三、自旋锁在内核的使用
以下内核源码来自于 Linux 5.15.0,即queued spinlock。
3.1 struct file
struct file {
/*
* Protects f_ep, f_flags.
* Must not be taken from IRQ context.
*/
spinlock_t f_lock;
......
unsigned int f_flags;
fmode_t f_mode;
......
}
spinlock_t f_lock 初始化:
/**
* alloc_file - allocate and initialize a 'struct file'
*
* @path: the (dentry, vfsmount) pair for the new file
* @flags: O_... flags with which the new file will be opened
* @fop: the 'struct file_operations' for the new file
*/
static struct file *alloc_file(const struct path *path, int flags,
const struct file_operations *fop)
{
struct file *file;
file = alloc_empty_file(flags, current_cred());
file->f_path = *path;
file->f_inode = path->dentry->d_inode;
file->f_mapping = path->dentry->d_inode->i_mapping;
......
file->f_mode |= FMODE_OPENED;
file->f_op = fop;
......
return file;
}
alloc_file()
-->alloc_empty_file()
-->__alloc_file()
static struct file *__alloc_file(int flags, const struct cred *cred)
{
struct file *f;
f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
spin_lock_init(&f->f_lock);
mutex_init(&f->f_pos_lock);
f->f_flags = flags;
f->f_mode = OPEN_FMODE(flags);
return f;
}
spinlock_t f_lock的使用:
static int setfl(int fd, struct file * filp, unsigned long arg)
{
......
spin_lock(&filp->f_lock);
filp->f_flags = (arg & SETFL_MASK) | (filp->f_flags & ~SETFL_MASK);
spin_unlock(&filp->f_lock);
......
}
保护 file 的 f_flags 成员变量。
int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
{
switch (advice) {
case POSIX_FADV_NORMAL:
file->f_ra.ra_pages = bdi->ra_pages;
spin_lock(&file->f_lock);
file->f_mode &= ~FMODE_RANDOM;
spin_unlock(&file->f_lock);
break;
case POSIX_FADV_RANDOM:
spin_lock(&file->f_lock);
file->f_mode |= FMODE_RANDOM;
spin_unlock(&file->f_lock);
break;
case POSIX_FADV_SEQUENTIAL:
file->f_ra.ra_pages = bdi->ra_pages * 2;
spin_lock(&file->f_lock);
file->f_mode &= ~FMODE_RANDOM;
spin_unlock(&file->f_lock);
break;
......
}
}
保护 file 的 f_mode 成员变量。
3.2 struct dentry
#define USE_CMPXCHG_LOCKREF \
(IS_ENABLED(CONFIG_ARCH_USE_CMPXCHG_LOCKREF) && \
IS_ENABLED(CONFIG_SMP) && SPINLOCK_SIZE <= 4)
struct lockref {
union {
#if USE_CMPXCHG_LOCKREF
aligned_u64 lock_count;
#endif
struct {
spinlock_t lock;
int count;
};
};
};
#define d_lock d_lockref.lock
struct dentry {
/* Ref lookup also touches following */
struct lockref d_lockref; /* per-dentry lock and refcount */
}
/*
* dentry->d_lock spinlock nesting subclasses:
*
* 0: normal
* 1: nested
*/
enum dentry_d_lock_class
{
DENTRY_D_LOCK_NORMAL, /* implicitly used by plain spin_lock() APIs. */
DENTRY_D_LOCK_NESTED
};
使用比如例程1:
/**
* d_alloc - allocate a dcache entry
* @parent: parent of entry to allocate
* @name: qstr of the name
*
* Allocates a dentry. It returns %NULL if there is insufficient memory
* available. On a success the dentry is returned. The name passed in is
* copied and the copy passed in may be reused after this call.
*/
struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
{
struct dentry *dentry = __d_alloc(parent->d_sb, name);
if (!dentry)
return NULL;
spin_lock(&parent->d_lock);
/*
* don't need child lock because it is not subject
* to concurrency here
*/
__dget_dlock(parent);
dentry->d_parent = parent;
list_add(&dentry->d_child, &parent->d_subdirs);
spin_unlock(&parent->d_lock);
return dentry;
}
EXPORT_SYMBOL(d_alloc);
分配dentry时初始化spinlock_t:
/**
* __d_alloc - allocate a dcache entry
* @sb: filesystem it will belong to
* @name: qstr of the name
*
* Allocates a dentry. It returns %NULL if there is insufficient memory
* available. On a success the dentry is returned. The name passed in is
* copied and the copy passed in may be reused after this call.
*/
static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
{
struct dentry *dentry;
dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
......
dentry->d_lockref.count = 1;
spin_lock_init(&dentry->d_lock);
seqcount_spinlock_init(&dentry->d_seq, &dentry->d_lock);
......
}
例程2:
/**
* __d_lookup - search for a dentry (racy)
* @parent: parent dentry
* @name: qstr of name we wish to find
* Returns: dentry, or NULL
*
* __d_lookup is like d_lookup, however it may (rarely) return a
* false-negative result due to unrelated rename activity.
*
* __d_lookup is slightly faster by avoiding rename_lock read seqlock,
* however it must be used carefully, eg. with a following d_lookup in
* the case of failure.
*
* __d_lookup callers must be commented.
*/
struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
{
unsigned int hash = name->hash;
struct hlist_bl_head *b = d_hash(hash);
struct hlist_bl_node *node;
struct dentry *found = NULL;
struct dentry *dentry;
/*
* Note: There is significant duplication with __d_lookup_rcu which is
* required to prevent single threaded performance regressions
* especially on architectures where smp_rmb (in seqcounts) are costly.
* Keep the two functions in sync.
*/
/*
* The hash list is protected using RCU.
*
* Take d_lock when comparing a candidate dentry, to avoid races
* with d_move().
*
* It is possible that concurrent renames can mess up our list
* walk here and result in missing our dentry, resulting in the
* false-negative result. d_lookup() protects against concurrent
* renames using rename_lock seqlock.
*
* See Documentation/filesystems/path-lookup.txt for more details.
*/
rcu_read_lock();
hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {
if (dentry->d_name.hash != hash)
continue;
spin_lock(&dentry->d_lock);
if (dentry->d_parent != parent)
goto next;
if (d_unhashed(dentry))
goto next;
if (!d_same_name(dentry, parent, name))
goto next;
dentry->d_lockref.count++;
found = dentry;
spin_unlock(&dentry->d_lock);
break;
next:
spin_unlock(&dentry->d_lock);
}
rcu_read_unlock();
return found;
}
该函数用于在父目录项中搜索指定名称的目录项。
3.3 struct inode
struct inode {
spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
}
使用例程1:
/**
* d_find_alias - grab a hashed alias of inode
* @inode: inode in question
*
* If inode has a hashed alias, or is a directory and has any alias,
* acquire the reference to alias and return it. Otherwise return NULL.
* Notice that if inode is a directory there can be only one alias and
* it can be unhashed only if it has no children, or if it is the root
* of a filesystem, or if the directory was renamed and d_revalidate
* was the first vfs operation to notice.
*
* If the inode has an IS_ROOT, DCACHE_DISCONNECTED alias, then prefer
* any other hashed alias over that one.
*/
struct dentry *d_find_alias(struct inode *inode)
{
struct dentry *de = NULL;
if (!hlist_empty(&inode->i_dentry)) {
spin_lock(&inode->i_lock);
de = __d_find_alias(inode);
spin_unlock(&inode->i_lock);
}
return de;
}
EXPORT_SYMBOL(d_find_alias);
使用例程2:
/*
* When a file is deleted, we have two options:
* - turn this dentry into a negative dentry
* - unhash this dentry and free it.
*
* Usually, we want to just turn this into
* a negative dentry, but if anybody else is
* currently using the dentry or the inode
* we can't do that and we fall back on removing
* it from the hash queues and waiting for
* it to be deleted later when it has no users
*/
/**
* d_delete - delete a dentry
* @dentry: The dentry to delete
*
* Turn the dentry into a negative dentry if possible, otherwise
* remove it from the hash queues so it can be deleted later
*/
void d_delete(struct dentry * dentry)
{
struct inode *inode = dentry->d_inode;
spin_lock(&inode->i_lock);
spin_lock(&dentry->d_lock);
/*
* Are we the only user?
*/
if (dentry->d_lockref.count == 1) {
dentry->d_flags &= ~DCACHE_CANT_MOUNT;
dentry_unlink_inode(dentry);
} else {
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
spin_unlock(&inode->i_lock);
}
}
EXPORT_SYMBOL(d_delete);
使用例程3:
static struct inode *alloc_inode(struct super_block *sb)
{
const struct super_operations *ops = sb->s_op;
struct inode *inode;
if (ops->alloc_inode)
inode = ops->alloc_inode(sb);
else
inode = kmem_cache_alloc(inode_cachep, GFP_KERNEL);
......
inode_init_always(sb, inode);
......
return inode;
}
/**
* inode_init_always - perform inode structure initialisation
* @sb: superblock inode belongs to
* @inode: inode to initialise
*
* These are initializations that need to be done on every inode
* allocation as the fields are not initialised by slab allocation.
*/
int inode_init_always(struct super_block *sb, struct inode *inode)
{
spin_lock_init(&inode->i_lock);
}
使用例程4:
/**
* do_inode_permission - UNIX permission checking
* @mnt_userns: user namespace of the mount the inode was found from
* @inode: inode to check permissions on
* @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC ...)
*
* We _really_ want to just do "generic_permission()" without
* even looking at the inode->i_op values. So we keep a cache
* flag in inode->i_opflags, that says "this has not special
* permission function, use the fast case".
*/
static inline int do_inode_permission(struct user_namespace *mnt_userns,
struct inode *inode, int mask)
{
if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
if (likely(inode->i_op->permission))
return inode->i_op->permission(mnt_userns, inode, mask);
/* This gets set once for the inode lifetime */
spin_lock(&inode->i_lock);
inode->i_opflags |= IOP_FASTPERM;
spin_unlock(&inode->i_lock);
}
return generic_permission(mnt_userns, inode, mask);
}
3.4 struct super_block
struct super_block {
......
/* s_inode_list_lock protects s_inodes */
spinlock_t s_inode_list_lock ____cacheline_aligned_in_smp;
struct list_head s_inodes; /* all inodes */
spinlock_t s_inode_wblist_lock;
struct list_head s_inodes_wb; /* writeback inodes */
}
spinlock_t s_inode_list_lock/s_inode_wblist_lock初始化:
/**
* alloc_super - create new superblock
* @type: filesystem type superblock should belong to
* @flags: the mount flags
* @user_ns: User namespace for the super_block
*
* Allocates and initializes a new &struct super_block. alloc_super()
* returns a pointer new superblock or %NULL if allocation had failed.
*/
static struct super_block *alloc_super(struct file_system_type *type, int flags,
struct user_namespace *user_ns)
{
struct super_block *s = kzalloc(sizeof(struct super_block), GFP_USER);
......
INIT_LIST_HEAD(&s->s_inodes);
spin_lock_init(&s->s_inode_list_lock);
INIT_LIST_HEAD(&s->s_inodes_wb);
spin_lock_init(&s->s_inode_wblist_lock);
......
}
s_inode_list_lock的使用:
/**
* inode_sb_list_add - add inode to the superblock list of inodes
* @inode: inode to add
*/
void inode_sb_list_add(struct inode *inode)
{
spin_lock(&inode->i_sb->s_inode_list_lock);
list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
spin_unlock(&inode->i_sb->s_inode_list_lock);
}
EXPORT_SYMBOL_GPL(inode_sb_list_add);
static inline void inode_sb_list_del(struct inode *inode)
{
if (!list_empty(&inode->i_sb_list)) {
spin_lock(&inode->i_sb->s_inode_list_lock);
list_del_init(&inode->i_sb_list);
spin_unlock(&inode->i_sb->s_inode_list_lock);
}
/**
* invalidate_inodes - attempt to free all inodes on a superblock
* @sb: superblock to operate on
* @kill_dirty: flag to guide handling of dirty inodes
*
* Attempts to free all inodes for a given superblock. If there were any
* busy inodes return a non-zero value, else zero.
* If @kill_dirty is set, discard dirty inodes too, otherwise treat
* them as busy.
*/
int invalidate_inodes(struct super_block *sb, bool kill_dirty)
{
int busy = 0;
struct inode *inode, *next;
LIST_HEAD(dispose);
again:
spin_lock(&sb->s_inode_list_lock);
list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
spin_lock(&inode->i_lock);
if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
spin_unlock(&inode->i_lock);
continue;
}
if (inode->i_state & I_DIRTY_ALL && !kill_dirty) {
spin_unlock(&inode->i_lock);
busy = 1;
continue;
}
if (atomic_read(&inode->i_count)) {
spin_unlock(&inode->i_lock);
busy = 1;
continue;
}
inode->i_state |= I_FREEING;
inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);
list_add(&inode->i_lru, &dispose);
if (need_resched()) {
spin_unlock(&sb->s_inode_list_lock);
cond_resched();
dispose_list(&dispose);
goto again;
}
}
spin_unlock(&sb->s_inode_list_lock);
dispose_list(&dispose);
return busy;
}
/**
* fsnotify_unmount_inodes - an sb is unmounting. handle any watched inodes.
* @sb: superblock being unmounted.
*
* Called during unmount with no locks held, so needs to be safe against
* concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
*/
static void fsnotify_unmount_inodes(struct super_block *sb)
{
struct inode *inode, *iput_inode = NULL;
spin_lock(&sb->s_inode_list_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
/*
* We cannot __iget() an inode in state I_FREEING,
* I_WILL_FREE, or I_NEW which is fine because by that point
* the inode cannot have any associated watches.
*/
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
spin_unlock(&inode->i_lock);
continue;
}
/*
* If i_count is zero, the inode cannot have any watches and
* doing an __iget/iput with SB_ACTIVE clear would actually
* evict all inodes with zero i_count from icache which is
* unnecessarily violent and may in fact be illegal to do.
* However, we should have been called /after/ evict_inodes
* removed all zero refcount inodes, in any case. Test to
* be sure.
*/
if (!atomic_read(&inode->i_count)) {
spin_unlock(&inode->i_lock);
continue;
}
__iget(inode);
spin_unlock(&inode->i_lock);
spin_unlock(&sb->s_inode_list_lock);
if (iput_inode)
iput(iput_inode);
/* for each watch, send FS_UNMOUNT and then remove it */
fsnotify_inode(inode, FS_UNMOUNT);
fsnotify_inode_delete(inode);
iput_inode = inode;
cond_resched();
spin_lock(&sb->s_inode_list_lock);
}
spin_unlock(&sb->s_inode_list_lock);
if (iput_inode)
iput(iput_inode);
}
s_inode_wblist_lock的使用:
// linux-5.15/fs/fs-writeback.c
/*
* mark an inode as under writeback on the sb
*/
void sb_mark_inode_writeback(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
unsigned long flags;
if (list_empty(&inode->i_wb_list)) {
spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
if (list_empty(&inode->i_wb_list)) {
list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
trace_sb_mark_inode_writeback(inode);
}
spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
}
}
这个函数用于将给定的索引节点标记为正在写回的状态,并将其添加到超级块的写回列表中。函数首先检查索引节点的写回列表是否为空,如果是空的,则获取超级块的写回列表锁,然后再次检查是否为空以避免重复添加。如果索引节点的写回列表为空,则将该索引节点添加到超级块的写回列表中,并记录写回操作的跟踪信息。
// linux-5.15/fs/fs-writeback.c
/*
* clear an inode as under writeback on the sb
*/
void sb_clear_inode_writeback(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
unsigned long flags;
if (!list_empty(&inode->i_wb_list)) {
spin_lock_irqsave(&sb->s_inode_wblist_lock, flags);
if (!list_empty(&inode->i_wb_list)) {
list_del_init(&inode->i_wb_list);
trace_sb_clear_inode_writeback(inode);
}
spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags);
}
}
这个函数用于将给定的索引节点从超级块的写回列表中清除,表示该索引节点不再处于写回状态。函数首先检查索引节点的写回列表是否非空,如果非空,则获取超级块的写回列表锁,然后再次检查是否非空以确保正确性。如果索引节点的写回列表非空,则将该索引节点从超级块的写回列表中删除,并记录清除写回状态的跟踪信息。
// linux-5.15/fs/fs-writeback.c
/*
* The @s_sync_lock is used to serialise concurrent sync operations
* to avoid lock contention problems with concurrent wait_sb_inodes() calls.
* Concurrent callers will block on the s_sync_lock rather than doing contending
* walks. The queueing maintains sync(2) required behaviour as all the IO that
* has been issued up to the time this function is enter is guaranteed to be
* completed by the time we have gained the lock and waited for all IO that is
* in progress regardless of the order callers are granted the lock.
*/
static void wait_sb_inodes(struct super_block *sb)
{
LIST_HEAD(sync_list);
/*
* We need to be protected against the filesystem going from
* r/o to r/w or vice versa.
*/
WARN_ON(!rwsem_is_locked(&sb->s_umount));
mutex_lock(&sb->s_sync_lock);
/*
* Splice the writeback list onto a temporary list to avoid waiting on
* inodes that have started writeback after this point.
*
* Use rcu_read_lock() to keep the inodes around until we have a
* reference. s_inode_wblist_lock protects sb->s_inodes_wb as well as
* the local list because inodes can be dropped from either by writeback
* completion.
*/
rcu_read_lock();
spin_lock_irq(&sb->s_inode_wblist_lock);
list_splice_init(&sb->s_inodes_wb, &sync_list);
/*
* Data integrity sync. Must wait for all pages under writeback, because
* there may have been pages dirtied before our sync call, but which had
* writeout started before we write it out. In which case, the inode
* may not be on the dirty list, but we still have to wait for that
* writeout.
*/
while (!list_empty(&sync_list)) {
struct inode *inode = list_first_entry(&sync_list, struct inode,
i_wb_list);
struct address_space *mapping = inode->i_mapping;
/*
* Move each inode back to the wb list before we drop the lock
* to preserve consistency between i_wb_list and the mapping
* writeback tag. Writeback completion is responsible to remove
* the inode from either list once the writeback tag is cleared.
*/
list_move_tail(&inode->i_wb_list, &sb->s_inodes_wb);
/*
* The mapping can appear untagged while still on-list since we
* do not have the mapping lock. Skip it here, wb completion
* will remove it.
*/
if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
continue;
spin_unlock_irq(&sb->s_inode_wblist_lock);
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
spin_unlock(&inode->i_lock);
spin_lock_irq(&sb->s_inode_wblist_lock);
continue;
}
__iget(inode);
spin_unlock(&inode->i_lock);
rcu_read_unlock();
/*
* We keep the error status of individual mapping so that
* applications can catch the writeback error using fsync(2).
* See filemap_fdatawait_keep_errors() for details.
*/
filemap_fdatawait_keep_errors(mapping);
cond_resched();
iput(inode);
rcu_read_lock();
spin_lock_irq(&sb->s_inode_wblist_lock);
}
spin_unlock_irq(&sb->s_inode_wblist_lock);
rcu_read_unlock();
mutex_unlock(&sb->s_sync_lock);
}
这个函数的作用是等待特定超级块上的所有索引节点的写回操作完成。函数会遍历超级块中的写回列表,并等待所有正在写回的页面写入完成。这确保了在函数退出时,已经发出的所有 I/O 操作都已完成。
参考资料
Linux 2.6.20
Linux 5.15.0
https://medium.com/geekculture/the-linux-kernel-locking-api-and-shared-objects-1169c2ae88ff