PostgreSQL技术内幕5:PostgreSQL存储引擎从磁盘到内存的读取
文章目录
- 0.简介
- 1.背景知识
- 1.1 计算机存储结构
- 1.2 数据库常见的磁盘和内存访问形式
- 2. 整体获取层次
- 3.元组介绍
- 4. Buffer管理
- 4.1 Buffer组成
- 4.2 修改后落盘
- 4.3 获取buffer页的流程
- 5.存储管理器(SMGR)
- 6.磁盘管理器(MD)
- 7.虚拟文件管理器(VFD)
- 8.物理文件存储介绍
- 9. 总结
0.简介
本篇内容介绍PG从磁盘到内存的加载流程,经过那些层级,各层级作用以及源码分析,主要包括共享缓存(Buffer),存储管理器,磁盘管理器,虚拟文件管理器以及部分物理文件介绍。
1.背景知识
1.1 计算机存储结构
计算机存储层级如下图,速度层级越往下越慢。
1.2 数据库常见的磁盘和内存访问形式
常见的访问方式为在磁盘以page形式存储,在内存中存储到Buffer Pool中,如下图:
2. 整体获取层次
PG整体获取一个元组涉及的层级如图,下面将对每一层进行详细说明:
3.元组介绍
Tuple结构如下
Header信息:
struct HeapTupleHeaderData
{
union
{
HeapTupleFields t_heap;
DatumTupleFields t_datum;
} t_choice;
ItemPointerData t_ctid; /* current TID of this or newer tuple (or a
* speculative insertion token) */
/* Fields below here must match MinimalTupleData! */
#define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK2 2
uint16 t_infomask2; /* number of attributes + various flags */
#define FIELDNO_HEAPTUPLEHEADERDATA_INFOMASK 3
uint16 t_infomask; /* various flag bits, see below */
#define FIELDNO_HEAPTUPLEHEADERDATA_HOFF 4
uint8 t_hoff; /* sizeof header incl. bitmap, padding */
/* ^ - 23 bytes - ^ */
#define FIELDNO_HEAPTUPLEHEADERDATA_BITS 5
bits8 t_bits[FLEXIBLE_ARRAY_MEMBER]; /* bitmap of NULLs */
/* MORE DATA FOLLOWS AT END OF STRUCT */
};
4. Buffer管理
4.1 Buffer组成
PG中buffer 由三部分组成,如下图,创建函数可见CreateSharedMemoryAndSemaphores。
1)Buffer table layer:一个hash表,记录buffer描述符和buff pool的映射信息(即为tag->bufferid(buffer pool的下标)的映射)
2)Buffer descriptors layer:buffer描述符
3)Buffer pool:真正的buffer数据
Buffer descriptors结构如下:
typedef struct BufferDesc
{
BufferTag tag; /* ID of page contained in buffer */
int buf_id; /* buffer's index number (from 0) */
/* state of the tag, containing flags, refcount and usagecount */
pg_atomic_uint32 state;
int wait_backend_pid; /* backend PID of pin-count waiter */
int freeNext; /* link in freelist chain */
LWLock content_lock; /* to lock access to buffer contents */
} BufferDesc;
//对应到页
typedef struct buftag
{
RelFileNode rnode; /* physical relation identifier */
ForkNumber forkNum; /*file type*/
BlockNumber blockNum; /* blknum relative to begin of reln */
} BufferTag;
//唯一确定表
typedef struct RelFileNode
{
Oid spcNode; /* tablespace */
Oid dbNode; /* database */
Oid relNode; /* relation */
} RelFileNode;
4.2 修改后落盘
可见MarkBufferDirty函数,其主要作用就是增加该页的BM_DIRTY状态,该状态的页面淘汰前会落盘。
4.3 获取buffer页的流程
可见ReadBuffer_common函数
4.4 Buffer的淘汰策略
PG中使用的使用ClockSweep淘汰策略,其相关结构如下
typedef struct
{
/* Spinlock: protects the values below */
slock_t buffer_strategy_lock;
/*
* Clock sweep hand: index of next buffer to consider grabbing. Note that
* this isn't a concrete buffer - we only ever increase the value. So, to
* get an actual buffer, it needs to be used modulo NBuffers.
*/
pg_atomic_uint32 nextVictimBuffer;
int firstFreeBuffer; /* Head of list of unused buffers */
int lastFreeBuffer; /* Tail of list of unused buffers */
/*
* NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
* when the list is empty)
*/
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
*/
uint32 completePasses; /* Complete cycles of the clock sweep */
pg_atomic_uint32 numBufferAllocs; /* Buffers allocated since last reset */
/*
* Bgworker process to be notified upon activity or -1 if none. See
* StrategyNotifyBgWriter.
*/
int bgwprocno;
} BufferStrategyControl;
buffer descriptor是循环链表,数组标识usage_count,nextVictimBuffer是32位usigned int,总是指向某个buffer descriptor并按顺时针顺序旋转。淘汰页面代码:StrategyGetBuffer。主要做到事情就是引用数为0,使用数减一,使用为0可以淘汰。
5.存储管理器(SMGR)
负责统一管理不同存储介质的对接,通过虚函数表来调用。每个backend使用SMGR的SMgrRelationHash管理SMgrRelationData,实现加速访问,SMgrRelationData记录一张打开的表的信息。
typedef struct SMgrRelationData
{
/* rnode is the hashtable lookup key, so it must be first! */
RelFileNodeBackend smgr_rnode; /* relation physical identifier */
/* pointer to owning pointer, or NULL if none */
struct SMgrRelationData **smgr_owner;
/*
* These next three fields are not actually used or manipulated by smgr,
* except that they are reset to InvalidBlockNumber upon a cache flush
* event (in particular, upon truncation of the relation). Higher levels
* store cached state here so that it will be reset when truncation
* happens. In all three cases, InvalidBlockNumber means "unknown".
*/
BlockNumber smgr_targblock; /* current insertion target block */
BlockNumber smgr_fsm_nblocks; /* last known size of fsm fork */
BlockNumber smgr_vm_nblocks; /* last known size of vm fork */
/* additional public fields may someday exist here */
/*
* Fields below here are intended to be private to smgr.c and its
* submodules. Do not touch them from elsewhere.
*/
int smgr_which; /* storage manager selector */
/*
* for md.c; per-fork arrays of the number of open segments
* (md_num_open_segs) and the segments themselves (md_seg_fds).
*/
int md_num_open_segs[MAX_FORKNUM + 1];
struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
/* if unowned, list link in list of all unowned SMgrRelations */
dlist_node node;
} SMgrRelationData;
外部调用流程如下:
1)外部buffer pool调用smgrread获取数据
2)smgr调用smgrrsw去获取
3)smgrsw调用底层MD接口,具体接口使用函数表定义
static const f_smgr smgrsw[] = {
/* magnetic disk */
{mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend,
mdprefetch, mdread, mdwrite, mdwriteback, mdnblocks, mdtruncate,
mdimmedsync, mdpreckpt, mdsync, mdpostckpt
}
};
具体代码文件:smgr.c
6.磁盘管理器(MD)
Md是berkeyley开源的磁盘管理器,可以访问磁盘和ssd,可以看md.c文件。负责磁盘文件的打开,创建,删除。
7.虚拟文件管理器(VFD)
虚拟文件管理器的作用是为了防止句柄数量超过操作系统的限制,如果超过就需要进行淘汰。采用的是LRU的淘汰机制,可以看fd.c。
8.物理文件存储介绍
在这里和MySQL InnoDB做个比较,InnoDB采用的是段页式管理,而PG采用的只有8k分页,代码上来说更为简洁。
MySQL InnoDB:
PG是按8k分页,可以看到其实是将变与不变部分进行了分离,item为大小不变部分(也就是单个记录的元数据),tuple是大小可变部分。
//Header信息如下:
typedef struct PageHeaderData
{
/* XXX LSN is member of *any* block, not only page-organized ones */
PageXLogRecPtr pd_lsn; /* LSN: next byte after last byte of xlog
* record for last change to this page */
uint16 pd_checksum; /* checksum */
uint16 pd_flags; /* flag bits, see below */
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
uint16 pd_pagesize_version;
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */
} PageHeaderData;
typedef PageHeaderData *PageHeader;
//ItemData信息如下:
typedef struct ItemIdData
{
unsigned lp_off:15, /* offset to tuple (from start of page) */
lp_flags:2, /* state of item pointer, see below */
lp_len:15; /* byte length of tuple */
} ItemIdData;
typedef ItemIdData *ItemId;
9. 总结
本篇介绍了元组获取的五个层级,从发出获取元组的请求,依次经过Buffer(真正的页数据建立起的映射),存储管理器(SMGR,支持扩展对接不同的存储介质,目前只有磁盘),磁盘管理器(MD,真正访问磁盘,read,write操作),虚拟文件管理器(VFD,用以防止文件打开数目超过限制),最终获得磁盘的页。