当前位置：首页 > article >正文

pdb_strand_id、asym_id 和 entity_id的相互映射

article 2025/3/11 15:54:17

在 mmCIF 文件中，pdb_strand_id、asym_id 和 entity_id 是三个关键的标识符，用于描述生物大分子结构中的不同层次和单位。本示例代码将读取一个 .cif.gz 压缩格式的 mmCIF 文件，并提取其中的 entity_poly 和 pdbx_poly_seq_scheme 对象的信息，并构建了链 ID 到非对称单位 ID、实体 ID 到标准氨基酸序列的映射字典。

- pdb_strand_id 更接近于 PDB 文件格式的传统链标识符，常用于表示 PDB 文件中的链 ID（如链 A、链 B 等）。
- asym_id 是 mmCIF 文件中的标识符，通常用于表示非对称单元中的具体链，具有更严格的定义，尤其是在涉及对称关系或多个副本时。
- 在很多情况下，pdb_strand_id 和 asym_id 是相同的，尤其是在结构简单且没有对称关系的情况下，但对于更加复杂的结构或包含对称操作的结构，它们可能不同。
- entity_id 表示的是整个结构中的一个生物分子实体，和链的编号（pdb_strand_id 和 asym_id）不同。entity_id 用来标识分子层面的唯一实体，而不是具体的物理副本。

代码：

from mmcif.io.PdbxReader import PdbxReader
import gzip

# mmCIF 文件路径
cif_file_path = '/path/to/your/file.cif.gz'

data = []
# 打开并读取 gzipped mmCIF 文件
with gzip.open(cif_file_path, 'rt') as cif:
    reader = PdbxReader(cif)
    reader.read(data)

# 提取第一个数据块
data = data[0]

# 获取 pdbx_poly_seq_scheme 对象
pdbx_poly_seq_scheme = data.getObj('pdbx_poly_seq_scheme')

# 获取 entity_poly 对象
entity_poly = data.getObj('entity_poly')

# 打印 entity_poly 的所有行信息
for row in entity_poly.getRowList():
    #print(row)
    #print("=====")
    entity_id = row[entity_poly.getIndex('entity_id')]
    polymer_type = row[entity_poly.getIndex('type')]
    strand_id = row[entity_poly.getIndex('pdbx_strand_id')]
    sequence = row[entity_poly.getIndex('pdbx_seq_one_letter_code_can')]
    
    #print(row[entity_poly.getIndex('num_poly_seq')])

    print(f'Entity ID: {entity_id}, Type: {polymer_type}, Strand ID: {strand_id}, Sequence: {sequence}')

# pdbx_strand_id 到 asym_id 映射
pdb2asym = dict({
        (r[pdbx_poly_seq_scheme.getIndex('pdb_strand_id')],
         r[pdbx_poly_seq_scheme.getIndex('asym_id')]) 
        for r in data.getObj('pdbx_poly_seq_scheme').getRowList()
    })

# asym_id 到 entity_id 映射
chs2num = {pdb2asym[ch]:r[entity_poly.getIndex('entity_id')] 
               for r in entity_poly.getRowList() 
               for ch in r[entity_poly.getIndex('pdbx_strand_id')].split(',')
               if r[entity_poly.getIndex('type')]=='polypeptide(L)'}

# get canonical sequences for polypeptide chains
# entity_id中多肽链的序列
num2seq = {r[entity_poly.getIndex('entity_id')]:r[entity_poly.getIndex('pdbx_seq_one_letter_code_can')].replace('\n','') 
           for r in entity_poly.getRowList() 
           if r[entity_poly.getIndex('type')]=='polypeptide(L)'}


print(pdb2asym)
print(chs2num)
print(num2seq)

相同的序列，共享相同的 entity_id

当一个生物分子形成复合物时，可能会有多个相同的链重复出现，这些链属于同一个生物分子（entity），但每条链对应不同的 chain_id。比如一个二聚体蛋白质结构，其中每个亚基是相同的蛋白质序列，这些亚基会被标识为不同的 chain_id，但它们共享相同的 entity_id。

查看全文

http://www.kler.cn/a/323411.html