SamOutV2 0.18B模型发布
- 采取 em参数共享在参数量减半的情况下将维度从1024 拉升到了1536
- sft 单论对话 loss 保持1.8
- 如果未来匹配state 推理代码性能不变的同时推理任意长度使用资源空间保持不变
import torch
class MaxState(torch.nn.Module):
def __init__(self, hidden_dim, heads):
super(MaxState, self).__init__()
assert hidden_dim % heads == 0, "Hidden size must be divisible by the number of heads."
self.head_size = hidden_dim // heads
self.head0 = torch.nn.Linear(hidden_dim, hidden_dim, bias=False)
self.head1 = torch.nn.Linear(hidden_dim, hidden_dim, bias=False)
self.head2 = torch.nn.Linear(hidden_dim, hidden_dim, bias=False)
self.head_num = heads
self.hidden = hidden_dim
def forward(self, input_data, state=None):
b, s, k, h = input_data.shape[0], input_data.shape[1], self.head_num, self.head_size
out = self.head0(input_data)
out1 = self.head1(input_data)
out2 = self.head2(input_data)
out = out.reshape([b, s, k, h]).permute([0, 2, 1, 3])
out1 = out1.reshape([b, s, k, h]).permute([0, 2, 1, 3])
out = torch.cummax((out + out1) / h ** 0.5, 2)[0]
out = out.permute([0, 2, 1, 3])
out1 = out1.permute([0, 2, 1, 3])
out = out.reshape([b, s, -1])
out1 = out1.reshape([b, s, -1])
out = (out + out2) * out + out1
return out, state
class FeedForward(torch.nn.Module):
def __init__(self, hidden_size):
super(FeedForward, self).__init__()
self.ffn1 = torch.nn.Linear(hidden_size, hidden_size // 2)
self.ffn2 = torch.nn.Linear(hidden_size // 2, hidden_size)
self.gate = torch.nn.Linear(hidden_size, hidden_size // 2)
self.relu = torch.nn.ReLU()
def forward(self, x):
x1 = self.ffn1(x)
x2 = self.relu(self.gate(x))
xx = x1 * x2
x =
return x
class DecoderLayer(torch.nn.Module):
def __init__(self, hidden_size, num_heads):
super(DecoderLayer, self).__init__()
self.self_attention = MaxState(hidden_size, num_heads)
self.ffn = FeedForward(hidden_size)
self.layer_norm = torch.nn.LayerNorm(hidden_size)
self.alpha = torch.nn.Parameter(torch.tensor(0.5))
def forward(self, x, state=None, ):
x1, state = self.self_attention(x, state)
x = self.layer_norm(self.alpha * self.ffn(x1) + (1 - self.alpha) * x)
return x, state
class SamOut(torch.nn.Module):
def __init__(self, voc_size, hidden_size, num_heads, num_layers):
super(SamOut, self).__init__()
self.em = torch.nn.Embedding(voc_size, hidden_size, padding_idx=3)
self.decoder_layers = torch.nn.ModuleList([DecoderLayer(hidden_size, num_heads) for _ in range(num_layers)])
self.head = FeedForward(hidden_size)
def state_forward(self, state, x):
if state is None:
state = [None] * len(self.decoder_layers)
i = 0
for ii, decoder_layer in enumerate(self.decoder_layers):
x1, state[i] = decoder_layer(x, state[i])
x = x1 + x
i += 1
return x, state
def forward(self, x, state=None):
x = self.em(x)
x, state = self.state_forward(state, x)
em = self.head(self.em.weight) / x.shape[-1]
return x @ em.permute([1, 0]), state
if __name__ == '__main__':
net = SamOut(235, 256, 16, 4)
net(torch.randint(0, 200, [2, 8 * 13]))
这段代码定义了一个名为 SamOut
MaxState 类
是一个多头机制的模块,它接收输入数据,并通过三个线性变换层 (head0
, head1
, head2
) 来处理数据。这个模块的主要功能是计算累积最大值(cumulative maximum)并且在计算过程中考虑了缩放因子(h ** 0.5
)。每个头的输出尺寸是 hidden_dim / heads
,即 head_size
FeedForward 类
是一个前馈神经网络(feed-forward network),它包含了两个线性变换层 (ffn1
和 ffn2
) 和一个门控机制 (gate
)。它还使用了 ReLU 激活函数和 dropout 层来防止过拟合。这个模块通常用作 Transformer 架构中的一部分,用来增加模型的表达能力。
DecoderLayer 类
包含了一个 MaxState
实例作为自注意力机制的一部分,以及一个 FeedForward
网络。此外,它还实现了残差连接(residual connection)和层归一化(layer normalization)。alpha
参数用于调整来自 self_attention
和 ffn
SamOut 类
是整个解码器模型的主体,它整合了所有上述组件。它接受词汇表大小 (voc_size
)、隐藏层大小 (hidden_size
)、多头数 (num_heads
) 和层数 (num_layers
) 作为参数。它有一个嵌入层 (em
) 来将输入token映射到向量空间,并且有一系列的 DecoderLayer
实例组成的模块列表 (decoder_layers
)。最后,它还有一个 FeedForward
层 (head
) 用于最终的输出转换。
Forward 方法
方法首先将输入token序列通过嵌入层转换为向量。- 然后,这些向量被传递给由多个
组成的堆栈,每一层都会更新状态信息。 - 最终,通过
在主程序部分,创建了一个 SamOut
实例 net
,并随机生成了一些整数作为输入来测试模型。这里的 torch.randint(0, 200, [2, 8 * 13])
生成了一个形状为 [2, 104]
的张量,其中的元素是从 0 到 200 的随机整数,代表batch size为2的一批输入序列,每个序列长度为104。
请注意,这里展示的模型结构是高度定制化的,并不是标准的Transformer架构。特别是 MaxState