SparkSql分区损坏的问题解决

系统教程3560 更新时间：2026-04-04 05:42:51

2023年12月6日发(作者：)

SparkSql分区损坏的问题解决

Spark查询分区表

spark-sql -e

"SELECT

*

FROM

td_fixed_http_flow

WHERE

dt = '2018-12-02'

AND HOUR = '16' ；"

出现异常：

Caused by: tFoundException: File hdfs://rzx121:8020/apps/hive/warehouse/td_fixed_http_flow/dt=2018-11-

17/hour=17 does not exist.

at butedFileSystem$DirListingIterator.(:1081)

at butedFileSystem$DirListingIterator.(:1059)

at butedFileSystem$(:1004)

at butedFileSystem$(:1000)

at e(:81)

at catedStatus(:1018)

at catedStatus(:1736)

at catedStatus(:668)

at dState(:389)

at utFormat$ternal(:672)

at utFormat$$600(:640)

at utFormat$FileGenerator$(:662)

at utFormat$FileGenerator$(:659)

at ileged(Native Method)

at (:422)

at (:1869)

at utFormat$(:659)

at utFormat$(:640)

at (:266)

at ker(:1142)

at PoolExecutor$(:617)

at (:745)

问题原因：定期清除hdfs历史数据，执行过 hadoop fs -rmr /apps/hive/warehouse/td_fixed_http_flow_hour/dt=2018-11-

17/hour=17

这个问题在执行hive sql的时候不会出现问题

hive -e

"SELECT

*

FROM

td_fixed_http_flow

WHERE

dt = '2018-12-02'

AND HOUR = '16' ；"

发现没有报错，hive不会去查找分区中hdfs不存在的目录

那为什么spark sql会报错呢？问题的根源在于Spark加载hive分区表数据会根据show partitions中的分区去加载，发现目录缺失就会出错了。

解决办法：

1、损坏分区数据不参与计算(hive采用这种方式解决)

在Spark程序中设置

PartitionPath=true;

表示忽略这个损坏的分区数据、

或者在

PartitionPath=true;中设置

2、重新建分区overwrite或者drop分区，下面提供drop分区的方式

alter table td_fixed_http_flow_hour drop partition (dt='2018-11-17',hour='17);

本文发布于:2023-12-06，感谢您对本站的认可！

本文链接:https://www.fzithome.com/xitong/1701792099a43696.html

SparkSql分区损坏的问题解决

发布评论取消回复

最近发表

相关推荐

标签列表

SparkSql分区损坏的问题解决

发布评论 取消回复

最近发表

相关推荐

标签列表

发布评论取消回复