基于Docker的Spark分布式集群
目录
1. 说明
2. 服务器规划
3. 步骤
3.1 要点
3.2 配置文件
3.2 访问Spark Master
4. 使用测试
5. 参考
1. 说明
- 以docker容器方式实现apache spark计算集群,能灵活的增减配置与worker数目。
2. 服务器规划
服务器 (1master, 3workers) | ip | 开放端口 | 备注 |
---|---|---|---|
center01.dev.sb | 172.16.20.20 | 8080,7077 | 硬件配置:32核64G 软件配置:ubuntu22.04 + 宝塔面板 |
host001.dev.sb | 172.16.20.60 | 8081,7077 | 8核16G |
host002.dev.sb | 172.16.20.61 | 8081,7077 | ... |
BEN-ZX-GZ-MH | 172.16.1.106 | 应用服务,发任务机器 |
3. 步骤
3.1 要点
- worker节点的网络模式用host,不然spark ui页面中获取的路径会是容器ip,里面的链接变得不可访问
- 测试前需保证任务发布机与Worker机的运行语言版本一致(如: 同是python10 / python12),否则会报错 "Python in worker has different version (3, 12) than that in driver 3.10"。
- 确保发任务机器能被Worker节点访问,否则会出现诸如:
"WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources"
等莫名其妙的错误,观察工作机错误日志:
"Caused by: java.io.IOException: Failed to connect to BEN-ZX-GZ-MH/<unresolved>:10027"
由于访问不了发任务机器而导致的,目前采取的解决方法是在配置里写死映射IP
3.2 配置文件
docker-compose.spark-master.yml
services:
spark:
image: docker.io/bitnami/spark:latest
container_name: spark-master
restart: always
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
ports:
- '8080:8080'
- '7077:7077'
docker-compose.spark-worker.yml
services:
spark-worker:
image: docker.io/bitnami/spark:latest
container_name: spark-worker
restart: always
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_CORES=2
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
ports:
- '8081:8081'
- '7077:7077'
extra_hosts:
- "spark-master:172.16.20.20"
- "BEN-ZX-GZ-MH:172.16.1.106"
network_mode: host
3.2 访问Spark Master
访问Spark Master,可见已有两台worker机可供驱使
4. 使用测试
t3.py
from pyspark.sql import SparkSession
def main():
# Initialize SparkSession
spark = (
SparkSession.builder.appName("HelloSpark") # type: ignore
.master("spark://center01.dev.sb:7077")
.config("spark.executor.memory", "512m")
.config("spark.cores.max", "1")
# .config("spark.driver.bindAddress", "center01.dev.sb")
.getOrCreate()
)
# Create an RDD containing numbers from 1 to 10
numbers_rdd = spark.sparkContext.parallelize(range(1, 11))
# Count the elements in the RDD
count = numbers_rdd.count()
print(f"Count of numbers from 1 to 10 is: {count}")
# Stop the SparkSession
spark.stop()
if __name__ == "__main__":
main()
运行监控
结果
5. 参考
- containers/bitnami/spark at main · bitnami/containers · GitHub