当前位置：首页 > article >正文

Hive其五，使用技巧，数据查询，日志以及复杂类型的使用

article 2024/12/23 1:20:19

一、关于Hive使用的一些技巧

二、表的数据查询

三、Hive默认的日志

四、复杂数据类型

1、Array的使用

2、展开函数的使用 explode

3、Map的使用

4、Struct结构体

一、关于Hive使用的一些技巧

1、可以直接不进入hive的情况下执行sql语句

通过shell的参数 -e 可以执行一次就运行完的命令

hive -e "select * from databaseName.student"

hive -S -e "set" | grep cli.print
-S 是静默模式,会省略掉多余的输出

假如我想在查询语句的结果上面显示字段名称，可以将
set hive.cli.print.header=true;

想永久设置：修改/opt/installs/hive/conf/.hiverc文件

set hive.cli.print.header=true;

2、hive可以直接运行sql文件

hive -f  sql语句的路径

先创建一个sql语句test.sql：
use databaseName;
insert into student values(1,'cfxj');

hive -f test.sql  即可运行

3、可以在hive中执行linux命令

在Hive的shell中加上前缀! 最后以分号;结尾,可以执行linux的命令

！ ls /home/hivedata ;

4、可以在hive中操作hdfs

用户可以在Hive的shell中执行HDFS的DFS命令,不用敲入前缀hdfs或者hadoop

dfs -ls /user/hive/warehouse

5、设置显示当前数据库

<property>
    <name>hive.cli.print.current.db</name>
    <value>false</value>
    <description>Whether to include the current database in the Hive prompt.</description>
</property>

将value设置为true即可。
跟之前在.hiverc中设置 set  hive.cli.print.current.db=true; 效果是一样的。

思考：设置一个属性有几种方法？

第0种：命令行方式

1、启动hive时，可以在命令行添加 --hiveconf param = value来设定参数
2、测试：通过命令行参数方式，配置hive不打印当前数据库名 
hive --hiveconf hive.cli.print.current.db = false
注意：命令行参数方式仅仅对本次hive启动有效。

假如 .hiverc 中也有相同的配置，以.hiverc 为准

第一种：直接在hive的窗口上设置 
set hive.cli.print.current.db=true;
它的设置可以覆盖.hiverc中的配置。
第二种：在hive的conf下的.hiverc 设置    
第三种：修改hive-site.xml 进行设置      
第四种：默认设置                       

到底以哪种方式设置为准：
第一种 > 第二种>  所谓的第0种 > 第三种 > 第四种  【就近原则】
假如窗口中设置以窗口为准，顶掉前面所有地方的设置，假如.hiverc设置，.hiverc 中的设置会顶替到它之前所有的设置，以此类推。

第四种默认设置：hive在安装的时候会有元数据，元数据中的设置为默认设置，假如你想更改设置，需要自己编写一个hive-site.xml ，在这个文件中想顶掉哪个默认设置就写哪个。

第一种和第二种其实是一种，生命周期都是客户端进入，hive进入后，会自动加载.hiverc文件，将里面的set执行一遍。断开连接后，配置消失。

hive-site.xml 中其实只需要编写自己需要的配置即可，没必要复制全部！！！！

查看当前session中的设置：

set hive.cli.print.current.db;

6、设置本地模式运行速度更快（小任务）

-- 开启本地模式
set hive.exec.mode.local.auto=true
-- 当文件大小小于这个的值才会进入本地模式
set hive.exec.mode.local.auto.inputbytes.max=134217728
-- 假如文件的数量小于这个值才会进入本地模式
set hive.exec.mode.local.auto.input.files.max=4


.hiverc中不要写注释，否则报错

假如运行报：文件打开过多的错误，请修改如下配置

vi /etc/security/limits.conf

在下方添加：
root soft nofile 65535
root hard nofile 65535

二、表的数据查询

select ..
from ..
	join [tableName] on ..
	where ..
	group by ..
	having ..
	order by ..
	sort by ..
	limit ..
union | union all ...

执行顺序：

第一步: FROM <left_table>
第二步: ON <join_condition>
第三步: <join_type> JOIN <right_table>
第四步: WHERE <where_condition>
第五步: GROUP BY <group_by_list>
第六步: HAVING <having_condition>
第七步: SELECT
第八步: DISTINCT <select_list>
第九步: ORDER BY <order_by_condition>
第十步: LIMIT <limit_number>

讲一下count:

count的执行:
1. 执行效果上：
	- count(*)包括了所有的列，相当于行数，在统计结果的时候不会忽略null值
	- count(1)包括了所有列，用1代表行，在统计结果的时候也不会忽略null值
	- count(列名)只包括列名那一列，在统计结果时，会忽略null值

 比如：
 name [列名]
 zhangsan
 null
 lisi

 select count(name)  from xxxx;   == 2
 select count(*)  from xxxx;   == 3
 select count(1)  from xxxx;   == 3

2.执行效率上：跟是否是主键使用什么引擎等都有关系的。
	- 列名为主键，count(列名)会比count(1)快
	- 列名不为主键，count(1)会比count(列名)快
	- 如果表中有多个列并且没有主键，count（1）的效率高于count(*)
	- 如果有主键count(主键)效率是最高的
	- 如果表中只有一个字段count(*)效率最高

说一下limit:

limit 在mysql中 可以有两个参数 limit [m,] n
    select * from t_user limit 1,3; // 从第二行开始查找，查找3条
	 在hive中，只能有一个参数 limit n;  查询前n条。
	  一般情况下，在使用limit时，都会先order by排序。

union | union all：

select sname,sage from student where sex='男'
union
select tname,tage from teacher where tname like '张%';

union: 可以去重的
union all: 不去重。

join :

内连接:  [inner] join
2. 外连接 (outer join)：（引出一个驱动表的概念：驱动表里的数据全部显示)
  -   左外连接:left [outer] join, 左表是驱动表
  -   右外连接:right [outer] join, 右表是驱动表
  -   全外连接:full [outer] join, hive支持，mysql不支持.两张表里的数据全部显示出来.

假如  select * from a join b on xxxx=xxx;  --内连接

各种连接的示意图：

多个表join 会产生笛卡尔积。

select * from emp join dept;
假如 emp 14条数据
     dept 4条数据：  14 * 4 = 56
笛卡尔积中数据大部分都没有意义，所以要添加过滤条件。

select * from emp join dept on emp.deptno = dept.detpno;

left semi join --左半开连接【hive独有】

在hive中，有一种专有的join操作,left semi join,我们称之为半开连接。它是left join的一种优化形式，只能查询左表的信息，主要用于解决hive中左表的数据是否存在的问题。相当于exists关键字的用法。

举例：查询哪些员工有领导？
select * from emp A where exists (select 1 from emp B where B.empno = A.mgr );

第二种写法可以使用left semi join ：
select * from emp A left semi join emp B on A.mgr = B.empno;

注意： hive中不支持right semi join。

所有的左外连接都可以写成右外连接。

三、Hive默认的日志

默认位置 在 /tmp/root 下。对于我们使用hive非常有帮助。

关于hive的日志，有一个配置文件：在hive的conf文件夹下

mv hive-log4j2.properties.template hive-log4j2.properties

property.hive.log.dir = ${sys:java.io.tmpdir}/${sys:user.name}
property.hive.log.file = hive.log

有什么用处？

在复杂的sql语句运行时，假如失败了，hive窗口不显示，具体错误需要查看日志才能解决。

四、复杂数据类型

Array Map Struct

1、Array的使用

create table tableName(
......
colName array<基本类型>
......
)

说明：下标从0开始，越界不报错，以null代替

zhangsan	78,89,92,96
lisi	67,75,83,94
王五	23,12

新建表：

create table arr1(
  name string,
  scores array<int>
)
row format delimited
fields terminated by '\t'
collection items terminated by ',';

加载数据：

load data local inpath '/home/hivedata/arr1.txt' into table arr1;

hive (databaseName)> select * from arr1;
OK
arr1.name       arr1.scores
zhangsan        [78,89,92,96]
lisi    [67,75,83,94]
王五    [23,12]
Time taken: 0.32 seconds, Fetched: 3 row(s)

需求：

1、查询每一个学生的第一个成绩
select name,scores[0] from arr1;
name    _c1
zhangsan        78
lisi    67
王五    23
2、查询拥有三科成绩的学生的第二科成绩
select name,scores[1] from arr1 where size(scores) >=3;

3、查询所有学生的总成绩
select name,scores[0]+scores[1]+nvl(scores[2],0)+nvl(scores[3],0) from arr1;

以上写法有局限性，因为你不知道有多少科成绩，假如知道了，这样写也太Low

2、展开函数的使用 explode

为什么学这个，因为我们想把数据，变为如下格式：

zhangsan        78
zhangsan        89
zhangsan        92
zhangsan        96
lisi	67
lisi	75
lisi	83
lisi	94
王五	23
王五	12

explode 专门用于炸集合。

select explode(scores) from arr1;

col
78
89
92
96
67
75
83
94
23
12

想当然的以为加上name 就OK ，错误！
hive (databaseName)> select name,explode(scores) from arr1;
FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions

-- lateral view:虚拟表。

会将UDTF函数生成的结果放到一个虚拟表中，然后这个虚拟表会和输入行进行join来达到数据聚合的目的。

具体使用：

select name,cj from arr1 lateral view explode(scores) mytable as cj;

解释一下：
lateral view explode(scores) 形成一张虚拟的表，表名需要自己起
里面的列有几列，就起几个别名，其他的就跟正常的虚拟表一样了。

name    cj
zhangsan        78
zhangsan        89
zhangsan        92
zhangsan        96
lisi    67
lisi    75
lisi    83
lisi    94
王五    23
王五    12

select name,sum(cj) from arr1 lateral view explode(scores) mytable as cj group by name;
等同于如下写法：
select name,sum(score) from
   (select name,score from arr1 lateral view explode(scores) myscore as score ) t group by name;

需求4：查询每个人的最后一科的成绩
select name,scores[size(scores)-1] from arr1;

3、Map的使用

语法格式：

create table tableName(
.......
colName map<T,T>
......
)

上案例：

zhangsan	chinese:90,math:87,english:63,nature:76
lisi	chinese:60,math:30,english:78,nature:0
wangwu	chinese:89,math:25

建表：

create table map1(
  name string,
  scores map<string,int>
)
row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':';

加载数据：

load data local inpath '/home/hivedata/map1.txt' into table map1;

需求：

需求一：
#查询数学大于35分的学生的英语和自然成绩
select name,scores['english'],scores['nature'] from map1
where scores['math'] > 35;

需求二：-- 查看每个人的前两科的成绩总和
select name,scores['chinese']+scores['math'] from map1;

OK
name    _c1
zhangsan        177
lisi    90
wangwu  114
Time taken: 0.272 seconds, Fetched: 3 row(s)

需求三：将数据展示为：
-- 展开效果
zhangsan	chinese		90
zhangsan	math	87
zhangsan	english 	63
zhangsan	nature		76

select name,subject,cj   from map1 lateral view explode(scores) mytable as subject,cj ;

name    subject cj
zhangsan        chinese 90
zhangsan        math    87
zhangsan        english 63
zhangsan        nature  76
lisi    chinese 60
lisi    math    30
lisi    english 78
lisi    nature  0
wangwu  chinese 89
wangwu  math    25

需求四：统计每个人的总成绩
select name,sum(cj)   from map1 lateral view explode(scores) mytable as subject,cj  group by name;
假如根据总成绩降序排序，不能在order by 中使用虚拟表的别名
select name,sum(score) sumScore from map1 lateral view explode(scores) myscore as subject,score group by name order by sumScore desc;

需求5：
-- 将下面的数据格式
zhangsan        chinese 90
zhangsan        math    87
zhangsan        english 63
zhangsan        nature  76
lisi    chinese 60
lisi    math    30
lisi    english 78
lisi    nature  0
wangwu  chinese 89
wangwu  math    25
wangwu  english 81
wangwu  nature  9
-- 转成：
zhangsan chinese:90,math:87,english:63,nature:76
lisi chinese:60,math:30,english:78,nature:0
wangwu chinese:89,math:25,english:81,nature:9

造一些数据（新建表）：

create table map_temp as
select name,subject,cj   from map1 lateral view explode(scores) mytable as subject,cj ;

第一步，先将学科和成绩形成一个kv对，其实就是字符串拼接


学习一下 concat的用法：
hive (databaseName)> select concat('hello','world');
OK
_c0
helloworld
Time taken: 0.333 seconds, Fetched: 1 row(s)
hive (databaseName)> select concat('hello','->','world');
OK
_c0
hello->world
Time taken: 0.347 seconds, Fetched: 1 row(s)

实战一下：
select name,concat(subject,":",cj) from map_temp;

结果：
name    _c1
zhangsan        chinese:90
zhangsan        math:87
zhangsan        english:63
zhangsan        nature:76
lisi    chinese:60
lisi    math:30
lisi    english:78
lisi    nature:0
wangwu  chinese:89
wangwu  math:25

以上这个结果再合并：
select name,collect_set(concat(subject,":",cj)) from map_temp
group by name;

lisi    ["nature:0","english:78","math:30","chinese:60"]
wangwu  ["math:25","chinese:89"]
zhangsan        ["nature:76","english:63","math:87","chinese:90"]
将集合中的元素通过逗号进行拼接：
select name,concat_ws(",",collect_set(concat(subject,":",cj))) from map_temp group by name;

结果：
zhangsan chinese:90,math:87,english:63,nature:76
lisi chinese:60,math:30,english:78,nature:0
wangwu chinese:89,math:25,english:81,nature:9



学习到了三个函数：
concat 进行字符串拼接
collect_set() 将分组的数据变成一个set集合。里面的元素是不可重复的。
collect_list(): 里面是可以重复的。
concat_ws(分隔符,集合) : 将集合中的所有元素通过分隔符变为字符串。

想将数据变为：

lisi    {"chinese":"60","math":"30","english":"78","nature":"0"}
wangwu  {"chinese":"89","math":"25"}
zhangsan        {"chinese":"90","math":"87","english":"63","nature":"76"}

需求：将字符串变为map集合使用一个函数 str_to_map

select name,str_to_map(concat_ws(",",collect_set(concat(subject,":",cj)))) from map_temp group by name;

4、Struct结构体

create table tableName(
........
colName struct<subName1:Type,subName2:Type,........>
........
)

有点类似于java类
调用的时候直接.
colName.subName

数据准备：

zhangsan	90,87,63,76
lisi	60,30,78,0
wangwu	89,25,81,9

创建表：

create table if not exists struct1(
name string,
score struct<chinese:int,math:int,english:int,natrue:int>
)
row format delimited 
fields terminated by '\t'
collection items terminated by ',';

加载数据：

load data local inpath '/home/hivedata/struct1.txt' into table struct1;

查看数据，有点像map:

hive (databaseName)> select * from struct1;
OK
struct1.name    struct1.score
zhangsan        {"chinese":90,"math":87,"english":63,"natrue":76}
lisi    {"chinese":60,"math":30,"english":78,"natrue":0}
wangwu  {"chinese":89,"math":25,"english":81,"natrue":9}
Time taken: 0.272 seconds, Fetched: 3 row(s)

查询数学大于35分的学生的英语和语文成绩

select name, score.english,score.chinese from struct1 
 where score.math > 35;

 这个看着和map很像，所以我认为map里 也可以使用 xxx.xxx
 或者说我这里也可以使用[]
经过尝试：不可以。

查看全文

http://www.kler.cn/a/447241.html