当前位置: 首页 > article >正文

Hive:数据仓库利器

1. 简介

Hive是一个基于Hadoop的开源数据仓库工具,可以用来存储、查询和分析大规模数据。Hive使用SQL-like的HiveQL语言来查询数据,并将其结果存储在Hadoop的文件系统中。

2. 基本概念

介绍 Hive 的核心概念,例如表、分区、桶、HQL 等。

2.1 架构

Design - Apache Hive - Apache Software Foundation
组成详情
UIThe user interface for users to submit queries and other operations to the system. As of 2011 the system had a command line interface and a web based GUI was being developed.
DriverThe component which receives the queries. This component implements the notion of session handles and provides execute and fetch APIs modeled on JDBC/ODBC interfaces.
CompilerThe component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore.
MetastoreThe component that stores all the structure information of the various tables and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding HDFS files where the data is stored.
Execution EngineThe component which executes the execution plan created by the compiler. The plan is a DAG of stages. The execution engine manages the dependencies between these different stages of the plan and executes these stages on the appropriate system components.

2.2 Data Model

类型详情
TablesThese are analogous to Tables in Relational Databases. Tables can be filtered, projected, joined and unioned. Additionally all the data of a table is stored in a directory in HDFS. Hive also supports the notion of external tables wherein a table can be created on prexisting files or directories in HDFS by providing the appropriate location to the table creation DDL. The rows in a table are organized into typed columns similar to Relational Databases.
PartitionsEach Table can have one or more partition keys which determine how the data is stored, for example a table T with a date partition column ds had files with data for a particular date stored in the <table location>/ds=<date> directory in HDFS. Partitions allow the system to prune data to be inspected based on query predicates, for example a query that is interested in rows from T that satisfy the predicate T.ds = '2008-09-01' would only have to look at files in <table location>/ds=2008-09-01/ directory in HDFS.
BucketsData in each partition may in turn be divided into Buckets based on the hash of a column in the table. Each bucket is stored as a file in the partition directory. Bucketing allows the system to efficiently evaluate queries that depend on a sample of data (these are queries that use the SAMPLE clause on the table).

3. 实践应用

3.1 数仓建设

4. 性能优化

介绍如何优化 Hive 的性能

5. 常见问题解答

5.1 常用SQL

场景SQL
连续n天登录
SELECT * FROM test;

6. 总结

总结 Hive 的关键知识点,并提供学习资源和进一步研究方向。


http://www.kler.cn/a/273512.html

相关文章:

  • 【测试工具】Fastbot 客户端稳定性测试
  • python NLTK快速入门
  • 每日一言 动态图片
  • 相机硬触发
  • Swift 开发教程系列 - 第3章:控制流
  • 2、liunx网络基础
  • 关系数据库标准语言SQL
  • 链表练习1
  • Ubuntu软件开发环境搭建
  • 深入理解 C# Unity 中的事件和委托
  • 苍穹外卖-day13:vue基础回顾+进阶
  • qt开发记录
  • idea远程试调jar、远程试调war
  • 智能合约 - 部署ERC20
  • Visual Studio 常用快捷键
  • C++进阶之路---手撕“红黑树”
  • ZnO 阀片的非线性 U-I特性
  • 基于时空上下文(STC)的运动目标跟踪算法,Matlab实现
  • cf火线罗技鼠标宏最细教程(鬼跳,上箱,一键顺,usp速点,雷神三连发及压枪,AK火麒麟压枪.lua脚本)
  • springboot整合springsecurity,从数据库中认证
  • 小程序搜索排名优化二三事
  • 数据结构——lesson10排序之插入排序
  • 配置视图解析器
  • Tomcat:Session ID保持会话
  • DockerFile遇到的坑
  • Linux:Gitlab:16.9.2 (rpm包) 部署及基础操作(1)