Tap虚拟网卡
1 概述
Tap设备通常用于虚拟化场景下,其驱动代码位于drivers/net/tun.c,tap与tun复用大部分代码,
注:drivers/net/tap.c并不是tap设备的代码,而是macvtap和ipvtap;
下文中,我们统一称tap;参考下图tap设备架构:
图中标注了关键函数,以及数据流向。 tap设备分为两部分:
- 网卡功能,向上对接着内核协议栈,对应驱动中的数据结构tun_struct;
- 数据接口,向下对接虚拟网卡后端,对应驱动中的数据结构tun_file,它有两种接口:
- file,给用户态使用,在内核的处理函数是tun_chr_read/write_iter();
- socket,给内核态使用,主要是vhost,如上图中;
在上面的图中,哪部分是虚拟网卡?
- virtio-net + (qemu-vhost) + tap
- virtio-net是Guest上虚拟网卡的前端,
- qemu是控制平面,vhost是数据平面
- tap设备是需要网卡的后端;
- tap + (qemu-vhost) + virtio-net
- tap是Host上虚拟网卡的前端;
- qemu是控制平面,vhost是数据平面;
- virti-net是虚拟网卡的后端;
tap设备本身作为虚拟网卡,同时也是Guest虚拟网卡的后端,
- 作为Host虚拟网卡的前端 (为自己带盐);
- 作为virtio-net + (qemu-vhost)的后端 (给别人善后);
2 tun_file
2.1 创建
在我们open /dev/net/tun时,参考代码:
tun_chr_open()
---
tfile = (struct tun_file *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL,
&tun_proto, 0);
...
if (ptr_ring_init(&tfile->tx_ring, 0, GFP_KERNEL)) {
sk_free(&tfile->sk);
return -ENOMEM;
}
...
tfile->socket.file = file;
tfile->socket.ops = &tun_socket_ops;
sock_init_data(&tfile->socket, &tfile->sk);
...
file->private_data = tfile;
...
---
我们获得了一个fd,它对应着一个tun_file,这个tun_file中还有一个socket;但是,我们并不能对这个fd直接执行sendmsg/recvmsg,因为它代表的是一个char设备;要想获得tun_file中的socket,需要从内核态调用特殊接口:
get_socket()
-> get_tap_socket()
-> tun_get_socket()
---
if (file->f_op != &tun_fops)
return ERR_PTR(-EINVAL);
tfile = file->private_data;
if (!tfile)
return ERR_PTR(-EBADFD);
return &tfile->socket;
---
2.2 功能
对于tap虚拟网卡来说,每个tun_file就是它的一个通道,或者说队列;
在tun_net_xmit(),可以明显的看到其根据queue_mapping选择tun_file的流程:
tun_net_xmit()
---
int txq = skb->queue_mapping;
...
tfile = rcu_dereference(tun->tfiles[txq]);
...
if (ptr_ring_produce(&tfile->tx_ring, skb))
goto drop;
...
---
tun_file在创建之后,第一次通过ioctl TUNSETIFF,会创建一个tap设备;同时,也可以attach到一个已经存在的tap设备中,
TUNSETIFF 1st time,
create a net_device and attach current tun_file on it
------------------------------------------------------------------
tun_set_iff()
---
dev = alloc_netdev_mqs(sizeof(struct tun_struct), name,
NET_NAME_UNKNOWN, tun_setup, queues,
queues);
...
err = tun_attach(tun, file, false, ifr->ifr_flags & IFF_NAPI,
ifr->ifr_flags & IFF_NAPI_FRAGS, false);
...
err = register_netdevice(tun->dev);
...
strcpy(ifr->ifr_name, tun->dev->name);
...
// This name will be copied to userland
---
TUNSETIFF 2nd time,
attach another tun_file on this tun net_device
-------------------------------------------------------------------
tun_set_iff()
---
dev = __dev_get_by_name(net, ifr->ifr_name);
if (dev) {
...
err = tun_attach(tun, file, ifr->ifr_flags & IFF_NOFILTER,
ifr->ifr_flags & IFF_NAPI,
ifr->ifr_flags & IFF_NAPI_FRAGS, true);
...
}
---
tun_file一端对接Host Networking Stack,另一端则通过file或者socket对接着Tap虚拟网卡的后端, 作为skb通道,它主要包含两部分功能,缓存和事件通知;
- 当skb从Host协议栈发送进Tap设备时,
tun_net_xmit() --- if (ptr_ring_produce(&tfile->tx_ring, skb)) goto drop; /* NETIF_F_LLTX requires to do our own update of trans_start */ queue = netdev_get_tx_queue(dev, txq); queue->trans_start = jiffies; /* Notify and wake up reader process */ if (tfile->flags & TUN_FASYNC) kill_fasync(&tfile->fasync, SIGIO, POLL_IN); tfile->socket.sk->sk_data_ready(tfile->socket.sk); --- //sock_def_readable() sock_init_data() =====================SYNC========================== tun_recvmsg() / tun_chr_read_iter() -> tun_do_read() -> tun_ring_recv() --- ptr = ptr_ring_consume(&tfile->tx_ring); if (ptr) goto out; if (noblock) { error = -EAGAIN; goto out; } add_wait_queue(&tfile->socket.wq.wait, &wait); while (1) { set_current_state(TASK_INTERRUPTIBLE); ptr = ptr_ring_consume(&tfile->tx_ring); if (ptr) break; ... schedule(); } __set_current_state(TASK_RUNNING); remove_wait_queue(&tfile->socket.wq.wait, &wait); --- =====================ASYNC========================== vhost_net_enable_vq() --- sock = vhost_vq_get_backend(vq); if (!sock) return 0; return vhost_poll_start(poll, sock->file); --- tun_chr_poll() --- sk = tfile->socket.sk; poll_wait(file, sk_sleep(sk), wait); ... --- vhost_poll_init() --- init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup); --- sk_sleep()就是sk->sk_wq,在sk_def_readable()会对其执行唤醒操作,进而调用vhost_poll_wakeup(),后者会提交一个vhost work,执行handle_rx操作。
tun_file中的ptr ring会缓存skb,并通过skb的sk_data_ready()发出通知;等待事件有两种,同步或者异步,参考以上代码片段。
-
当skb从Tap设备发往Host协议栈时,代码较为简单:
tun_sendmsg() / tun_chr_write_iter() -> tun_get_user() -> tun_rx_batched() -> netif_receive_skb()