当前位置: 首页 > article >正文

parquet类型小文件合并

parquet类型小文件合并:
./2024-7-26/0049b78b48b65d63-7ec94dbc00000028_383261519_data.0.parq

hadoop jar ./parquet-tools-1.9.0.jar --help
WARNING: Use “yarn jar” to launch YARN applications.
usage: parquet-tools cat [option…]
where option is one of:
–debug Enable debug output
-h,–help Show this help string
-j,–json Show records in JSON format.
–no-color Disable color output even if supported
where is the parquet file to print to stdout

usage: parquet-tools head [option…]
where option is one of:
–debug Enable debug output
-h,–help Show this help string
-n,–records The number of records to show (default: 5)
–no-color Disable color output even if supported
where is the parquet file to print to stdout

usage: parquet-tools schema [option…]
where option is one of:
-d,–detailed Show detailed information about the schema.
–debug Enable debug output
-h,–help Show this help string
–no-color Disable color output even if supported
where is the parquet file containing the schema to show

usage: parquet-tools meta [option…]
where option is one of:
–debug Enable debug output
-h,–help Show this help string
–no-color Disable color output even if supported
where is the parquet file to print to stdout

usage: parquet-tools dump [option…]
where option is one of:
-c,–column Dump only the given column, can be specified more than
once
-d,–disable-data Do not dump column data
–debug Enable debug output
-h,–help Show this help string
-m,–disable-meta Do not dump row group and page metadata
-n,–disable-crop Do not crop the output based on console width
–no-color Disable color output even if supported
where is the parquet file to print to stdout

usage: parquet-tools merge [option…] [ …]
where option is one of:
–debug Enable debug output
-h,–help Show this help string
–no-color Disable color output even if supported
where is the source parquet files/directory to be merged
is the destination parquet file

查看结构:
hadoop jar ./parquet-tools-1.9.0.jar schema ./0049b78b48b65d63-7ec94dbc00000028_383261519_data.0.parq
message schema {
optional binary id;
optional binary sn;
optional binary mes_sn;
optional binary line_code;
optional binary section_code;
optional binary station_code;
optional binary station_slot;
optional binary test_software_version;
optional binary test_time;
optional double elapsed_time;
optional binary test_result;
optional binary failitem;
optional binary failitems;
optional binary bg;
optional binary bu;
optional binary project_code;
optional binary project_name;
}

查看内容:
hadoop jar ./parquet-tools-1.9.0.jar head -n 10 ./0049b78b48b65d63-7ec94dbc00000028_383261519_data.0.parq

合并parquet小文件:原文件不删除,产生新的合并文件
hadoop jar ./parquet-tools-1.9.0.jar merge ./2024-7-26/ /tmp/all.parquet
合并结果:
hdfs dfs -du -h /tmp/all.parquet
280.6 M 841.7 M /tmp/all.parquet


http://www.kler.cn/a/445261.html

相关文章:

  • SpringBoot开发——整合JSONPath解析JSON信息
  • 前端小白学习之路-Vben探索 vite 配置 - 1/50
  • go聊天系统项目6-服务端发送消息
  • ​在VMware虚拟机上设置Ubuntu与主机共享文件夹​
  • 传统JavaWeb项目集成consul注册中心
  • RabbitMQ个人理解与基本使用
  • ESP32单片机开发
  • uniApp上传文件踩坑日记
  • 【C++ 无限循环】1625. 执行操作后字典序最小的字符串|1992
  • 深度学习在岩土工程中的应用与实践
  • PHP代码审计学习--zzcms8.1
  • 打靶记录22——Tomato
  • workman服务端开发模式-GatewayWorker的使用
  • JNDI基础
  • 【Threejs】从零开始(八)--贴图
  • list的常用操作
  • SQL server学习08-使用索引和视图优化查询
  • 使用Python开发高级游戏:创建一个3D射击游戏
  • C# OpenCV机器视觉:边缘检测
  • AI、大数据、机器学习、深度学习、神经网络之间的关系
  • 视频及JSON数据的导出并压缩
  • 数据库高可用性与容灾
  • 【k8s集群应用】kubeadm1.20(单master)
  • 电脑玩《刺客信条》时中,遇到找不到d3dx9_42.dll的问题是什么原因?缺失d3dx9_42.dll应该怎么解决呢?下面一起来看看吧!
  • 如何用细节提升用户体验?
  • 第33天:安全开发-JavaEE应用SQL预编译Filter过滤器Listener监听器访问控制