当前位置：首页 > article >正文

C/C++语言基础--C++字符串实现的三种方法(eager copy、cow、sso)

article 2024/11/28 8:39:46

本专栏目的

更新C/C++的基础语法，包括C++的一些新特性

前言

在C语言中，字符串是通过字符数组进行存储，但是在C++中基于char *类型接着进行了封装，实现，这里将讲解string三种实现方式；
C语言后面也会继续更新知识点，如内联汇编；
欢迎收藏 + 关注，本人将会持续更新。

文章目录

- - string实现方式
  - eager copy
  - - 解释
    - 难点分析
    - 实现结果
    - 实现代码
  - cow
  - - 解释
    - 实现讲解
    - 实现结果
    - 实现代码
  - sso
  - - 解释
    - 实现讲解
    - 实现代码
  - 参考资料

string实现方式

C++中有三种金典方式可以实现string类型，这里的三种方式是指存储字符串的策略，分别是：

eager copy
COW（Copy-On-Write）
SSO（Short-String-Optimization）

string中比较重要的三个字段：

char *data：指向存放字符串的首地址（在 SSO 的某些实现方案中可能没有此字段）
size_t size：字符串长度
size_t capacity：字符串容量

eager copy

解释

这个是最简单、最好理解的一种，在每次拷贝时将原 string 对应的内存以及所持有的动态资源完整地复制一份，即没有任何特殊处理。

📘 参考知乎大佬的博客：

在这里插入图片描述

优点：

实现简单。
每个对象互相独立，不用考虑那么多乱七八糟的场景。

缺点：

字符串较大时，拷贝浪费空间，浪费时间。

难点分析

👀 功能实现设置：
1、构造、析构函数
2、数值拷贝、拷贝、移动拷贝
3、赋值、移动赋值
4、“万金油函数”
5、输出
6、修改每一个字符

难点：

这里主要以实现简单为主，比较难的是内存扩容设计.

这里规定：

扩容策略，参考最大内存规定：INT_MAX
小于 32，则内容增加16，一倍扩容
大于32，则按照一般扩容，但是不允许超过最大的容量
如果扩容的内存没有达到需求的数量count，则扩容到count，这个时候扩容数量大于 INT_MAX

以实现拷贝为例：

// 准备
/*扩容策略，参考最大内存规定：INT_MAX
* 1、小于 32，则内容增加16，一倍扩容
* 2、大于32， 则按照一般扩容，但是不允许超过最大的容量
* 3、如果扩容的内存没有达到需求的数量count，则扩容到count，这个时候扩容数量大于 INT_MAX
*/
// 扩容策略
size_t grow_to(size_t count, size_t capacity, size_t maxsize)
{
	if (capacity < 32) {
		capacity += 16;
	}
	else {
		size_t temp = capacity + capacity / 2;
		if (temp > maxsize) {
			capacity = maxsize;
		}
		else {
			capacity = temp;
		}
	}

	if (capacity < count) {
		capacity = count;
	}

	return capacity;
}
// 辅助函数
const size_t maxSize() const 
{
	return INT_MAX;
}
//2、数值拷贝、拷贝、移动拷贝
EC_string(const char* data) 
{
	m_size = strlen(data);
	if (m_capacity < m_size + 1) {   // 注意 +1 是换行符
		m_capacity = grow_to(m_size + 1, m_capacity, maxSize());
	}
	m_data = new char[m_capacity];
	strcpy_s(m_data, m_size + 1, data);
}
EC_string(const EC_string& other)   
{
	if (m_capacity < other.m_size + 1) {
		m_capacity = grow_to(m_size + 1, m_capacity, maxSize());
	}

	// 利用上面函数
	EC_string temp(other.m_data);  // 拷贝一份
	std::swap(m_data, temp.m_data);
	std::swap(m_size, temp.m_size);
	std::swap(m_capacity, temp.m_size);
}
EC_string(EC_string&& other) noexcept
{
	m_data = other.m_data;
	m_capacity = other.m_capacity;
	m_size = other.m_size;
	// 资源转移，后回归原来参数
	other.m_data = nullptr;
	other.m_size = 0;
	other.m_capacity = 16;
}

实现结果

在这里插入图片描述

实现代码

#include <iostream>
#include <cstring>
#include <algorithm>
#include <cassert>
using namespace std;

/*
1、构造、析构函数
2、数值拷贝、浅拷贝、移动拷贝
3、赋值、移动赋值
4、“万金油函数”
5、输出
6、修改每一个字符
*/

// 扩容策略是重点

// 简单实现
class EC_string
{
public:
	//1、构造、析构函数
	EC_string() {};
	// 析构函数
	~EC_string() {
		if (m_data != nullptr) {
			delete m_data;
		}
	}

	// 准备
	/*扩容策略，参考最大内存规定：INT_MAX
	* 1、小于 32，则内容增加16，一倍扩容
	* 2、大于32， 则按照一般扩容，但是不允许超过最大的容量
	* 3、如果扩容的内存没有达到需求的数量count，则扩容到count，这个时候扩容数量大于 INT_MAX
	*/
	// 扩容策略
	size_t grow_to(size_t count, size_t capacity, size_t maxsize)
	{
		if (capacity < 32) {
			capacity += 16;
		}
		else {
			size_t temp = capacity + capacity / 2;
			if (temp > maxsize) {
				capacity = maxsize;
			}
			else {
				capacity = temp;
			}
		}

		if (capacity < count) {
			capacity = count;
		}

		return capacity;
	}
	// 辅助函数
	const size_t maxSize() const 
	{
		return INT_MAX;
	}
	//2、数值拷贝、浅拷贝、移动拷贝
	EC_string(const char* data) 
	{
		m_size = strlen(data);
		if (m_capacity < m_size + 1) {   // 注意 +1 是换行符
			m_capacity = grow_to(m_size + 1, m_capacity, maxSize());
		}
		m_data = new char[m_capacity];
		strcpy_s(m_data, m_size + 1, data);
	}
	EC_string(const EC_string& other)   
	{
		if (m_capacity < other.m_size + 1) {
			m_capacity = grow_to(m_size + 1, m_capacity, maxSize());
		}

		// 利用上面函数
		EC_string temp(other.m_data);  // 拷贝一份
		std::swap(m_data, temp.m_data);
		std::swap(m_size, temp.m_size);
		std::swap(m_capacity, temp.m_size);
	}
	EC_string(EC_string&& other) noexcept
	{
		m_data = other.m_data;
		m_capacity = other.m_capacity;
		m_size = other.m_size;
		// 资源转移，后回归原来参数
		other.m_data = nullptr;
		other.m_size = 0;
		other.m_capacity = 16;
	}

	//3、赋值、移动赋值
	EC_string& operator=(const EC_string& other)
	{
		// 拷贝一份
		m_data = new char[m_capacity];
		strcpy_s(m_data, m_size + 1, other.m_data);
		m_capacity = other.m_capacity;
		m_size = other.m_size;
	}
	EC_string& operator=(EC_string&& other) noexcept
	{
		m_data = other.m_data;
		m_capacity = other.m_capacity;
		m_size = other.m_size;
		// 资源转移，后回归原来参数
		other.m_data = nullptr;
		other.m_size = 0;
		other.m_capacity = 16;
	}

	//4、“万金油函数”
	size_t size() const 
	{
		return m_size;
	}
	bool empty()
	{
		return m_size == 0;
	}

	//5、输出
	friend ostream& operator<<(ostream& out, const EC_string& str)
	{
		if (str.m_data == nullptr) {
			return out;
		}
		for (int i = 0; i < str.m_size; i++) {
			out << str.m_data[i];
		}

		return out;
	}

	//6、修改每一个字符
	char& operator[](const size_t index)
	{
		if (index >= m_size) {
			std::cout << "index out" << std::endl;
			assert(NULL);
		}

		return m_data[index];
	}

private:
	char* m_data{ nullptr };
	size_t m_size{ 0 };
	size_t m_capacity{ 16 };			// 假设初始大小为16
};


int main()
{
	EC_string str1("Welcome to the sheep Pig blog");
	EC_string str2 = str1;

	cout << "str1: ";
	cout << str1 << endl;
	cout << "str2: ";
	cout << str2 << endl;

	// 互补干扰
	str1[0] = 'i';
	str2[0] = 'y';
	
	cout << "fix str1: ";
	cout << str1 << endl;
	cout << "fix str2: ";
	cout << str2 << endl;

	return 0;
}

cow

解释

COW，也称copy-on-write，这个也算是计算机里的基本思想了。不同于 eager copy 的每次拷贝都会复制，此种实现方式为写时复制。只有在某个 string 要对共享对象进行修改时，才会真正执行拷贝。

注意： 由于存在共享机制，所以需要一个std::atomic<size_t>，代表被多少对象共享。

📘 参考知乎大佬的博客：

在这里插入图片描述

这个就是不同对象共享一块字符串内存，但是写时复制：

在这里插入图片描述

优点：

字符串空间较大时，减少了分配、复制字符串的时间。

缺点：

refcount 需要原子操作，性能有损耗。
某些情况下会带来意外的开销，比如非 const 成员使用[]，这会触发 COW，因为无法知晓应用程序是否会对返回的字符做修改，下面解释会讲
其他更细节的缺点可以参考：std::string 的 Copy-on-Write：不如想象中美好

实现讲解

💁‍♂ 这里只是实现了一个简单的模拟，相比于上面eager copy的实现，就是用了再封装思想

⚾️ 准备：将字符串封装在内部类，这样这个字符串就是当做一个内部成员实现。

private:
	class StringRep
	{
	public:
		size_t useCount{ 1 };
		size_t size{0};
		size_t capacity{ 16 };
		char* data{nullptr};
		StringRep(const char* str)
		{
			size = strlen(str);
			capacity = size + 1;
			
			data = new char[capacity];
			strcpy_s(data, capacity, str);
		}
		~StringRep()
		{
			if(data) delete[] data;
		}
		size_t inc() { return ++useCount; }
		size_t dec() { return --useCount; }
		void destroy() { delete this; }
	};

👀 读时共享：这个实现就是，在赋值和拷贝的时候，实现的是浅拷贝，指向同一块内存，对应的是解释第一张图片，这个是实现了读时共享。

// 实现浅拷贝
COW_String(const class COW_String& other)
{
	_rep = other._rep;
	_rep->inc();
}

写时拷贝：这个是在修改的时候重新创建封装的内部类。

拷贝, 这个也是缺点，因为不确定是否改了原来字符串

char& operator[](int index)
{
	if (_rep->useCount > 1)
	{
		_rep->dec();
		_rep = new StringRep(_rep->data);   // 拷贝, 这个也是缺点，因为不确定是否改了
	}
	return _rep->data[index];
}

实现结果

在这里插入图片描述

实现代码

#include <iostream>
#include <cstring>
#include <algorithm>
#include <cassert>
using namespace std;

class COW_String
{
public:
	COW_String(const char* str) :_rep(new StringRep(str)) {}
	COW_String(const class COW_String& other)
	{
		_rep = other._rep;
		_rep->inc();
	}
	~COW_String() {
		if (_rep->dec() == 0)
			_rep->destroy();
	}
	char& operator[](int index)   // 以防万一，拷贝，这个也是缺点，因为不确定是否改了
	{
		if (_rep->useCount > 1)
		{
			_rep->dec();
			_rep = new StringRep(_rep->data);
		}
		return _rep->data[index];
	}

	char operator[](int index)const
	{
		return _rep->data[index];
	}
	friend std::ostream& operator<<(std::ostream& out, const COW_String& other)
	{
		int len = strlen(other._rep->data);
		for (int i = 0; i < len; i++)
		{
			out << other._rep->data[i];
		}
		return out;
	}

	// 获取数据地址
	const char* getData() const {
		return _rep->data;
	}
private:
	class StringRep
	{
	public:
		size_t useCount{ 1 };
		size_t size{ 0 };
		size_t capacity{ 16 };
		char* data{ nullptr };
		StringRep(const char* str)
		{
			size = strlen(str);
			capacity = size + 1;

			data = new char[capacity];
			strcpy_s(data, capacity, str);
		}
		~StringRep()
		{
			if (data) delete[] data;
		}
		size_t inc() { return ++useCount; }
		size_t dec() { return --useCount; }
		void destroy() { delete this; }
	};


private:
	StringRep* _rep;
};

int main()
{
	COW_String str1("Welcome to the sheep Pig blog");
	COW_String str2(str1);	  // 浅拷贝

	printf("%p\n", str1.getData());
	printf("%p\n", str2.getData());

	// 修改
	str1[0] = 'i';
	printf("************************************************\n");
	printf("%p\n", str1.getData());
	printf("%p\n", str2.getData());


	return 0;
}

sso

解释

Small String Optimization. 基于字符串大多数比较短的特点，利用 string 对象本身的栈空间，这个时候字符串存储在栈区，不在堆去，来存储短字符串，而当字符串长度大于某个临界值时，则使用 eager copy 的方式。

👓 数据原理： SSO 下，string 的数据结构会稍微复杂点，使用 union 来区分短字符串和长字符串的场景：

class string {
  char *start;
  size_t size;
  static const int localSize = 15;
  union{
    char buffer[localSize+1];      // 满足条件时，用来存放短字符串
    size_t capacity;
  }data;
};

短字符串，SSO：

在这里插入图片描述

长字符串，eager copy：

在这里插入图片描述

这种数据结构的实现下，SSO 的阈值一般是 15 字节。

优点：

短字符串时，无动态内存分配，这个时候就是用栈空间。

缺点：

string 对象占用空间比 eager copy 和 cow 要大。

实现讲解

这个实现起来就是创建字符串的时候，先判断栈内存能否放得下要存储的字符串，放不下才申请内存。

//是否是短字符串
bool isSmall()const { return _size < localSize; }

SSO_String(const SSO_String& other)
	{
		if (this == &other)
			return;

		if (!isSmall())     // 判断是否是段字符串
		{
			_capacity = other._capacity;
			_size = other._size;
			_str = new char[_capacity];
			strcpy_s(_str, _capacity, other._str);
		}
	}

实现代码

class SSO_String
{
public:
	SSO_String() {}
	SSO_String(const char* str) 
	{
		_size = strlen(str);
		if (_size < localSize)
		{
			_str = _buffer;		
		}
		else if (_size >= localSize)
		{
			_capacity = _size + 1;
			_str = new char[_capacity];
		}
		strcpy_s(_str, capacity(), str);
	}
	SSO_String(const SSO_String& other)
	{
		if (this == &other)
			return;

		if (!isSmall())    // 判断是否是段字符串
		{
			_capacity = other._capacity;
			_size = other._size;
			_str = new char[_capacity];
			strcpy_s(_str, _capacity, other._str);
		}
	}
	~SSO_String()
	{
		if (!isSmall())
		{
			delete[] _str;
		}
	}
	size_t capacity() const
	{
		if (isSmall())
			return localSize;
		else
			return _capacity;
	}
	friend std::ostream& operator<<(std::ostream& out, const SSO_String& other)
	{
		for (int i = 0; i < other._size; i++)
		{
			out << other._str[i];
		}
		return out;
	}
private:
	//是否是短字符串
	bool isSmall()const { return _size < localSize; }
private:
	char* _str{ nullptr };
	size_t _size{ 0 };

	static const size_t localSize = 16;
	union
	{
		size_t _capacity{ 0 };
		char _buffer[localSize];
	};
};