1. TiDB-Operator 备份到 Minio
创建minio s3
- 初始化minio
minio server $HOME/operator/data --console-address :9090
- 设置region为上海
创建tidb-operator备份CR
1.备份CR配置文件backup-s3.yaml
信息
apiVersion: pingcap.com/v1alpha1
kind: Backup
metadata:
name: backup2s3-dev
namespace: tidb-admin
labels:
user: paul
spec:
## Describes the compute resource requirements and limits of Backup.
## Ref: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 500m
memory: 512Mi
## List of environment variables to set in the container, like v1.Container.Env.
## Note that the following builtin env vars will be overwritten by values set here.
## - S3_PROVIDER
## - S3_ENDPOINT
## - AWS_REGION
## - AWS_ACL
## - AWS_STORAGE_CLASS
## - AWS_DEFAULT_REGION
## - AWS_ACCESS_KEY_ID
## - AWS_SECRET_ACCESS_KEY
## - GCS_PROJECT_ID
## - GCS_OBJECT_ACL
## From is the TidbCluster to be backed up.
## It takes high precedence than spec in BR. If `from` not set, cluster in BR will be backed up.
#from:
## Host is the address of the TidbCluster to be backed up, which is the service name of the TidbCluster, such as `basic-tidb`.
#host: alpha-tidb
## Port is the port of the TidbCluster to be backed up.
#port: 4000
## User is the accessing user of the TidbCluster to be backed up.
#user: root
## SecretName is the secret that contains the password of the accessing user of the TidbCluster to be backed up.
# secretName: sh.helm.release.v1.tidb-operator.v1
## TLSClientSecretName is the name of secret which stores tidb server client certificate.
## Defaults to nil.
# tlsClientSecretName: ""
backupType: full
backupMode: snapshot
## TikvGCLifeTime specifies the safe gc life time for Backup.
## The time limit during which data is retained for each GC, in the format of Go Duration.
## When a GC happens, the current time minus this value is the safe point.
## Defaults to 72h.
tikvGCLifeTime: 72h
s3:
provider: aws
secretName: minio-secret
bucket: tidbuss
prefix: tidb/s3
endpoint: http://192.168.1.2:9000
## StorageSize is the PV size specified for the backup operation.
## This value must be greater than the size of the TidbCluster to be backed up.
## Defaults to 100Gi.
storageSize: "100Gi"
## BR configuration.
## Ref: https://docs.pingcap.com/tidb/stable/backup-and-restore-tool
br:
## Cluster specifies name of TidbCluster to be backed up.
cluster: "alpha"
## Namespace specifies namespace of TidbCluster to be backed up.
clusterNamespace: "tidb-admin"
## LogLevel is the log level. Defaults to `info`.
# logLevel: "info"
## StatusAddr is the HTTP listening address for the status report service. Defaults to empty.
# statusAddr: ""
## Concurrency is the size of thread pool on each node that execute the backup task.
## Defaults to 4.
concurrency: 4
## RateLimit is the rate limit of the backup task, MB/s per node.
## If set to 4, the speed limit is 4 MB/s.The speed limit is not set by default.
# rateLimit: 0
## TimeAgo presents back up the data before `timeAgo`, e.g. 1m, 1h. Defaults to empty.
# timeAgo: 1m
## Checksum specifies whether to verify the files after the backup is completed.
## Defaults to `true``.
# checksum: true
## CheckRequirements specifies whether to check requirements before backup
# checkRequirements: true
## SendCredToTikv specifies whether the BR process passes its AWS or GCP privileges to the TiKV process.
## Defaults to `true``.
sendCredToTikv: true
## OnLine specifies whether online during restore. Defaults to false.
# onLine: false
## Options specifies the extra arguments that BR supports. These options has highest priority.
# options: []
## ToolImage specifies the tool image used in `Backup`, which supports BR and Dumpling images.
## For examples `spec.toolImage: pingcap/br:v5.2.0` or `spec.toolImage: pingcap/dumpling:v5.2.0`
## For BR image, if it does not contain tag, Pod will use image 'ToolImage:${TiKV_Version}'.
toolImage: pingcap/br:v6.5.5
## ImagePullSecrets is an optional list of references to secrets in the same namespace to use for pulling any of the images.
## If private registry is used, imagePullSecrets may be set.
## You can also set this in service account.
## Ref: https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod
# imagePullSecrets:
# - name: secretName
## TableFilter specifies tables that match the table filter rules for BR or Dumpling.
## Ref: https://docs.pingcap.com/tidb/stable/table-filter
## Defaults to empty.
# tableFilter: []
## Affinity for Backup pod scheduling
## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
# affinity: {}
## UseKMS to decrypt the secrets. Defaults to false.
useKMS: false
## ServiceAccount Specify service account of Backup.
serviceAccount: "tidb-backup-manager"
## CleanPolicy specifies whether to clean backup data when the Backup CR is deleted, if not set, the backup data will be retained.
## `Retain` represents that the backup data will be retained when the Backup CR is deleted.
## `OnFailure` represents that the backup data will be cleaned only for the failed backups when the Backup CR is deleted.
## `Delete` represents that the backup data will be cleaned when the Backup CR is deleted.
cleanPolicy: Retain
- 执行创建备份
kubectl -n tidb-admin apply -f backup-s3.yaml
备份错误与异常排查
- 错误日志如下:
E1105 10:41:10.821663 8 manager.go:408] Get backup metadata for backup files in s3://tidbuss/tidb/s3 of cluster tidb-admin/backup2s3-dev failed, err: read backup meta from bucket tidbuss and prefix tidb/s3: blob (key "backupmeta") (code=Unknown): BadRequest: Bad Request
status code: 400, request id: 1794B3FFD6488588, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
I1105 10:41:10.841852 8 backup_status_updater.go:128] Backup: [tidb-admin/backup2s3-dev] updated successfully
error: read backup meta from bucket tidbuss and prefix tidb/s3: blob (key "backupmeta") (code=Unknown): BadRequest: Bad Request
status code: 400, request id: 1794B3FFD6488588, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e
- tidb-operator br备份到minio s3异常排查
通过对tidb-operator项目代码进行分析,定位到util.GetBRMetaData方法,文件位于cmd/backup-manager/app/util/util.go
。
// 编写单元测试用例
func TestGetBRMetaData(t *testing.T) {
ctx := context.Background()
os.Setenv("AWS_ACCESS_KEY_ID", "tidb")
os.Setenv("AWS_SECRET_ACCESS_KEY", "Jianxin123")
provider := v1alpha1.StorageProvider{
S3: &v1alpha1.S3StorageProvider{
Provider: "aws",
Bucket: "tidbuss",
Prefix: "tidb/s3",
Endpoint: "http://192.168.1.2:9000",
SecretName: "minio-secrete",
},
}
_, err := GetBRMetaData(ctx, provider)
log.Fatalln(err)
}
// 修改原始方法,通过debug日志显示根本原因
// GetBRMetaData get backup metadata from cloud storage
func GetBRMetaData(ctx context.Context, provider v1alpha1.StorageProvider) (*kvbackup.BackupMeta, error) {
s, err := util.NewStorageBackend(provider, &util.StorageCredential{})
if err != nil {
return nil, err
}
defer s.Close()
var metaData []byte
// use exponential backoff, every retry duration is duration * factor ^ (used_step - 1)
backoff := wait.Backoff{
Duration: time.Second,
Steps: 6,
Factor: 2.0,
Cap: time.Minute,
}
fmt.Println("bucket", s.GetBucket())
// _, err = s.Attributes(ctx, "backupmeta")
obj, err := s.List(&blob.ListOptions{Prefix: "tidb/s3"}).Next(ctx)
fmt.Println("xx bucket", err, obj)
readBackupMeta := func() error {
exist, err := s.Exists(ctx, "backupmeta")
if err != nil {
return err
}
fmt.Println("IS existed", exist)
if !exist {
return fmt.Errorf("%s not exist", constants.MetaFile)
}
metaData, err = s.ReadAll(ctx, constants.MetaFile)
if err != nil {
return err
}
return nil
}
fmt.Println("xxxx", readBackupMeta())
isRetry := func(err error) bool {
return !strings.Contains(err.Error(), "not exist")
}
err = retry.OnError(backoff, isRetry, readBackupMeta)
if err != nil {
return nil, errors.Annotatef(err, "read backup meta from bucket %s and prefix %s", s.GetBucket(), s.GetPrefix())
}
backupMeta := &kvbackup.BackupMeta{}
err = proto.Unmarshal(metaData, backupMeta)
if err != nil {
return nil, errors.Annotatef(err, "unmarshal backup meta from bucket %s and prefix %s", s.GetBucket(), s.GetPrefix())
}
return backupMeta, nil
}
在修改后的方法中,定位到Bucket的Exists、Attribute的方法无法获得有用排查错误信息,转而采用List方法,李处minio s3存储的backupmeta文件,错误日志提示为缺乏region信息。
go test -timeout 30s -run ^TestGetBRMetaData$
bucket tidbuss
xx bucket blob (code=Unknown): AuthorizationHeaderMalformed: The authorization header is malformed; the region is wrong; expecting 'shanghai'.
status code: 400, request id: 1794B4674A2D5A48, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8 <nil>
xxxx blob (key "backupmeta") (code=Unknown): BadRequest: Bad Request
status code: 400, request id: 1794B4674B073F88, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
测试用例中的数据与backup-s3.yaml中的s3相关配置一直,其单元测试复现结果指向region配置项。
- 在单元测试中加入region配置
func TestGetBRMetaData(t *testing.T) {
ctx := context.Background()
os.Setenv("AWS_ACCESS_KEY_ID", "tidb")
os.Setenv("AWS_SECRET_ACCESS_KEY", "Jianxin123")
provider := v1alpha1.StorageProvider{
S3: &v1alpha1.S3StorageProvider{
Provider: "aws",
Region: "shanghai",
Bucket: "tidbuss",
Prefix: "tidb/s3",
Endpoint: "http://192.168.1.2:9000",
SecretName: "minio-secrete",
},
}
_, err := GetBRMetaData(ctx, provider)
log.Fatalln(err)
}
重跑该测试用例,显示测试通过,说明配置文件中s3相关内容中region为必填字段,需与minio的region配置保持一致。
hbu@Pauls-MacBook-Air util % go test -timeout 30s -run ^TestGetBRMetaData$
bucket tidbuss
xx bucket EOF <nil>
IS existed true
xxxx <nil>
IS existed true
2023/11/05 18:52:00 <nil>
exit status 1
FAIL github.com/pingcap/tidb-operator/cmd/backup-manager/app/util 0.442s
解决问题
- 修改tidb-operator备份到s3配置文件
backup2s3-dev.yaml
,在s3配置中添加region字段。
apiVersion: pingcap.com/v1alpha1
kind: Backup
metadata:
name: backup2s3-dev
namespace: tidb-admin
labels:
user: paul
spec:
## Describes the compute resource requirements and limits of Backup.
## Ref: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 500m
memory: 512Mi
## List of environment variables to set in the container, like v1.Container.Env.
## Note that the following builtin env vars will be overwritten by values set here.
## - S3_PROVIDER
## - S3_ENDPOINT
## - AWS_REGION
## - AWS_ACL
## - AWS_STORAGE_CLASS
## - AWS_DEFAULT_REGION
## - AWS_ACCESS_KEY_ID
## - AWS_SECRET_ACCESS_KEY
## - GCS_PROJECT_ID
## - GCS_OBJECT_ACL
## From is the TidbCluster to be backed up.
## It takes high precedence than spec in BR. If `from` not set, cluster in BR will be backed up.
#from:
## Host is the address of the TidbCluster to be backed up, which is the service name of the TidbCluster, such as `basic-tidb`.
#host: alpha-tidb
## Port is the port of the TidbCluster to be backed up.
#port: 4000
## User is the accessing user of the TidbCluster to be backed up.
#user: root
## SecretName is the secret that contains the password of the accessing user of the TidbCluster to be backed up.
# secretName: sh.helm.release.v1.tidb-operator.v1
## TLSClientSecretName is the name of secret which stores tidb server client certificate.
## Defaults to nil.
# tlsClientSecretName: ""
backupType: full
backupMode: snapshot
## TikvGCLifeTime specifies the safe gc life time for Backup.
## The time limit during which data is retained for each GC, in the format of Go Duration.
## When a GC happens, the current time minus this value is the safe point.
## Defaults to 72h.
tikvGCLifeTime: 72h
s3:
provider: aws
secretName: minio-secret
region: shanghai
bucket: tidbuss
prefix: tidb/s3
endpoint: http://192.168.1.2:9000
## StorageSize is the PV size specified for the backup operation.
## This value must be greater than the size of the TidbCluster to be backed up.
## Defaults to 100Gi.
storageSize: "100Gi"
## BR configuration.
## Ref: https://docs.pingcap.com/tidb/stable/backup-and-restore-tool
br:
## Cluster specifies name of TidbCluster to be backed up.
cluster: "alpha"
## Namespace specifies namespace of TidbCluster to be backed up.
clusterNamespace: "tidb-admin"
## LogLevel is the log level. Defaults to `info`.
# logLevel: "info"
## StatusAddr is the HTTP listening address for the status report service. Defaults to empty.
# statusAddr: ""
## Concurrency is the size of thread pool on each node that execute the backup task.
## Defaults to 4.
concurrency: 4
## RateLimit is the rate limit of the backup task, MB/s per node.
## If set to 4, the speed limit is 4 MB/s.The speed limit is not set by default.
# rateLimit: 0
## TimeAgo presents back up the data before `timeAgo`, e.g. 1m, 1h. Defaults to empty.
# timeAgo: 1m
## Checksum specifies whether to verify the files after the backup is completed.
## Defaults to `true``.
# checksum: true
## CheckRequirements specifies whether to check requirements before backup
# checkRequirements: true
## SendCredToTikv specifies whether the BR process passes its AWS or GCP privileges to the TiKV process.
## Defaults to `true``.
sendCredToTikv: true
## OnLine specifies whether online during restore. Defaults to false.
# onLine: false
## Options specifies the extra arguments that BR supports. These options has highest priority.
# options: []
## ToolImage specifies the tool image used in `Backup`, which supports BR and Dumpling images.
## For examples `spec.toolImage: pingcap/br:v5.2.0` or `spec.toolImage: pingcap/dumpling:v5.2.0`
## For BR image, if it does not contain tag, Pod will use image 'ToolImage:${TiKV_Version}'.
toolImage: pingcap/br:v6.5.5
## ImagePullSecrets is an optional list of references to secrets in the same namespace to use for pulling any of the images.
## If private registry is used, imagePullSecrets may be set.
## You can also set this in service account.
## Ref: https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod
# imagePullSecrets:
# - name: secretName
## TableFilter specifies tables that match the table filter rules for BR or Dumpling.
## Ref: https://docs.pingcap.com/tidb/stable/table-filter
## Defaults to empty.
# tableFilter: []
## Affinity for Backup pod scheduling
## Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
# affinity: {}
## UseKMS to decrypt the secrets. Defaults to false.
useKMS: false
## ServiceAccount Specify service account of Backup.
serviceAccount: "tidb-backup-manager"
## CleanPolicy specifies whether to clean backup data when the Backup CR is deleted, if not set, the backup data will be retained.
## `Retain` represents that the backup data will be retained when the Backup CR is deleted.
## `OnFailure` represents that the backup data will be cleaned only for the failed backups when the Backup CR is deleted.
## `Delete` represents that the backup data will be cleaned when the Backup CR is deleted.
cleanPolicy: Retain
- 删除原有运行Backup CR
kubectl -n tidb-admin delete -f backup-s3.yaml
- 从minio中删除原有备份地址中的文件
重新运行备份
hbu@Pauls-MacBook-Air backup % kubectl -n tidb-admin apply -f backup-s3.yaml
backup.pingcap.com/backup2s3-dev created
检查结果
hbu@Pauls-MacBook-Air data % kubectl -n tidb-admin get pod
NAME READY STATUS RESTARTS AGE
alpha-discovery-68588cd598-k5m56 1/1 Running 1 (12d ago) 15d
alpha-pd-0 1/1 Running 1 (12d ago) 15d
alpha-tidb-0 2/2 Running 2 (12d ago) 15d
alpha-tikv-0 1/1 Running 1 (12d ago) 15d
backup-backup2s3-dev-l5rq4 0/1 Completed 0 33s
tidb-controller-manager-54694444b9-ncj8z 1/1 Running 6 15d
hbu@Pauls-MacBook-Air data % kubectl -n tidb-admin get backup
NAME TYPE MODE STATUS BACKUPPATH BACKUPSIZE COMMITTS LOGTRUNCATEUNTIL AGE
backup2s3-dev full snapshot Complete s3://tidbuss/tidb/s3 271 kB 445430282047455233 42s
修改后的备份配置文件,成功触发tidb-operator备份到s3兼容存储minio。
总结
如果参照TiDB Operator官方文档,TiDB Operator执行备份到S3兼容存储minio相对容易一些。但是,TiDB Operator业务订制化开发工作需要开发者对相关字段掌握更多,才能更好的排查错误。
另外,AWS S3和Minio毕竟还是两种产品,有关Minio region设置和应用方式,也是开发过程需要关注的功能点。