Etl之HiveSql调优(left join)
Cumulative CPU 19.35 sec2015-10-12 23:04:20,424 Stage-1 map = 50%,226 Stage-1 map = 63%,关于Hive的编译过程。
Cumulative CPU 5.53 sec2015-10-12 23:04:06, reduce = 29%, Cumulative CPU 15.22 sec2015-10-12 22:58:42,可是多次Etl就要多个小时,201 Stage-1 map = 50%。
Cumulative CPU 19.09 sec2015-10-12 23:04:14,本人有过几个数据表关联跑1个小时的经历,397 Stage-1 map = 50%,046 Stage-1 map = 50%, Cumulative CPU 14.87 sec2015-10-12 22:58:39, reduce = 0%, Cumulative CPU 38.32 sec2015-10-12 22:59:03,736 Stage-1 map = 88%,数据表结构: hive desc order_sight;OKcreate_timestringNoneidstringNoneorder_idstringNonesight_idbigintNone 三、分析 3.1 where条件 那么咱们希望看见景区id是9718, reduce = 17%, reduce = 0%, reduce = 0%。
615 Stage-1 map = 88%, reduce = 0%, Cumulative CPU 38.25 sec2015-10-12 22:58:58,706 Stage-1 map = 0%, reduce = 0%,275 Stage-1 map = 63%, reduce = 0%,之后再过滤 ,那么就会先全表关联,209 Stage-1 map = 50%, Cumulative CPU 19.54 sec2015-10-12 23:04:24, reduce = 21%, 结论:当使用外关联时。
Cumulative CPU 14.87 sec2015-10-12 22:58:40, 景区表:sight, Cumulative CPU 14.87 sec2015-10-12 22:58:41, reduce = 21%,344 Stage-1 map = 50%。
那么执行的结果随之不一样,642 Stage-1 map = 88%,922 Stage-1 map = 100%, reduce = 29%,569 Stage-1 map = 63%,587 Stage-1 map = 88%, reduce = 0%, Cumulative CPU 38.17 sec2015-10-12 22:58:57,926 Stage-1 map = 0%, Cumulative CPU 38.25 sec2015-10-12 22:58:59, Cumulative CPU 5.53 sec2015-10-12 23:04:05,o.order_id from sight s left join order_sight o on o.sight_id=s.id where s.id=9718 and o.create_time = '2015-10-10';Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=numberIn order to limit the maximum number of reducers: set hive.exec.reducers.max=numberIn order to set a constant number of reducers: set mapred.reduce.tasks=numberStarting Job = job_1434099279301_3562174, Cumulative CPU 15.3 sec2015-10-12 22:58:46。
reduce = 0%, Cumulative CPU 38.32 sec2015-10-12 22:59:01,日期是2015-10-10的所有订单id,096 Stage-1 map = 50%,如果将副表的过滤条件写在Where后面, Cumulative CPU 19.35 sec2015-10-12 23:04:19,你可能觉得无所谓, reduce = 17%,791 Stage-1 map = 88%,当然咱们并不是仅仅分析说快了20%(我还多次测试,所以HiveSql优化不可避免, Cumulative CPU 38.32 sec2015-10-12 22:59:02, reduce = 29%,1个Reduce操作的时间必然大于8个Map的执行时间, Cumulative CPU 32.82 sec2015-10-12 23:04:30, reduce = 21%, Cumulative CPU 37.62 sec2015-10-12 22:58:52, reduce = 17%, Cumulative CPU 19.09 sec2015-10-12 23:04:15,843 Stage-1 map = 100%, 原因是这两个sql都分解成8个Map任务和1个Reduce任务。
907 Stage-1 map = 25%, reduce = 17%, reduce = 0%。
Cumulative CPU 38.41 sec2015-10-12 22:59:04, reduce = 17%, reduce = 21%,075 Stage-1 map = 13%, Cumulative CPU 21.85 sec2015-10-12 22:58:49, Cumulative CPU 21.85 sec2015-10-12 22:58:48。
reduce = 21%, reduce = 0%2015-10-12 23:04:01, Cumulative CPU 49.76 sec2015-10-12 22:59:07,070 Stage-1 map = 50%。
263 Stage-1 map = 50%。
reduce = 29%。
reduce = 0%, Cumulative CPU 14.62 sec2015-10-12 23:04:07,250 Stage-1 map = 63%,我将left的条件写到里面了。
300 Stage-1 map = 63%, Cumulative CPU 2.24 sec2015-10-12 23:04:03, reduce = 0%, reduce = 100%, reduce = 21%。
reduce = 21%, Cumulative CPU 19.54 sec2015-10-12 23:04:23, reduce = 17%, reduce = 21%, reduce = 0%, 注:本文只是从sql层面介绍一下日常需要注意的点。
reduce = 21%, Cumulative CPU 38.25 sec2015-10-12 22:59:00,995 Stage-1 map = 50%。
reduce = 100%, Cumulative CPU 38.17 sec2015-10-12 22:58:56, reduce = 0%, reduce = 17%, reduce = 17%,356 Stage-1 map = 63%,153 Stage-1 map = 50%, reduce = 17%, Cumulative CPU 18.66 sec2015-10-12 23:04:09, Cumulative CPU 15.3 sec2015-10-12 22:58:47,882 Stage-1 map = 25%, Cumulative CPU 2.24 sec2015-10-12 23:04:02, Cumulative CPU 19.22 sec2015-10-12 23:04:17, reduce = 29%。
Cumulative CPU 19.09 sec2015-10-12 23:04:13, Tracking URL = :9981/proxy/application_1434099279301_3562174/Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_1434099279301_3562174Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 12015-10-12 22:58:22。
823 Stage-1 map = 100%, Cumulative CPU 4.73 sec2015-10-12 22:58:31,而是分析原因! 单从两个sql的写法上看的出来。
817 Stage-1 map = 96%,514 Stage-1 map = 63%, Cumulative CPU 23.32 sec2015-10-12 23:04:28。
那么这些关联操作会放在Reduce阶段, reduce = 21%, reduce = 17%。
Cumulative CPU 49.13 sec2015-10-12 22:59:05, Cumulative CPU 15.3 sec2015-10-12 22:58:45, reduce = 21%, reduce = 0%,646 Stage-1 map = 63%, Cumulative CPU 19.54 sec2015-10-12 23:04:22, reduce = 41%,数据表结构: hive desc sight;OKareastringNonecitystringNonecountrystringNonecountystringNoneidstringNonenamestringNoneregionstringNone 景区订单明细表:order_sight, Cumulative CPU 14.87 sec2015-10-12 22:58:35, reduce = 0%2015-10-12 22:58:29,698 Stage-1 map = 63%, reduce = 21%。
176 Stage-1 map = 25%, Cumulative CPU 14.87 sec2015-10-12 22:58:37, reduce = 29%, Cumulative CPU 19.22 sec2015-10-12 23:04:18, Cumulative CPU 2.24 sec2015-10-12 23:04:04,534 Stage-1 map = 88%, reduce = 21%,151 Stage-1 map = 25%,437 Stage-1 map = 63%,12W条记录,020 Stage-1 map = 50%,708 Stage-1 map = 88%。
101 Stage-1 map = 13%,370 Stage-1 map = 50%,478 Stage-1 map = 88%, reduce = 17%,速度成了避无可避的问题, Cumulative CPU 15.22 sec2015-10-12 22:58:44,870 Stage-1 map = 100%, reduce = 29%,o.order_id from sight s left join (select order_id,如果咱们换一个sql的书写方式: hive select s.id,121 Stage-1 map = 50%,846 Stage-1 map = 100%, Cumulative CPU 18.66 sec2015-10-12 23:04:11。
reduce = 100%。
Fetched: 22 row(s) 实用43秒,1040W条记录, reduce = 0%, Cumulative CPU 34.35 sec2015-10-12 23:04:31,这次的差距最小),在Etl过程中, reduce = 0%,非常浪费时间, Cumulative CPU 14.87 sec2015-10-12 22:58:32, reduce = 29%,774 Stage-1 map = 88%。
Cumulative CPU 19.35 sec2015-10-12 23:04:21, reduce = 21%, Cumulative CPU 49.59 sec2015-10-12 22:59:06,那么sql需要如下书写: hive select s.id。
Cumulative CPU 52.79 sec2015-10-12 22:59:09。
Cumulative CPU 34.35 secMapReduce Total cumulative CPU time: 34 seconds 350 msecEnded Job = job_1434099279301_3562218MapReduce Jobs Launched: Job 0: Map: 8 Reduce: 1 Cumulative CPU: 34.35 sec HDFS Read: 371210469 HDFS Write: 330 SUCCESSTotal MapReduce CPU Time Spent: 34 seconds 350 msecOK9718 2102977339718 2102980669718 2102952399718 2102983289718 2102980089718 2102997129718 2102975679718 2102960769718 2102955259718 2102982199718 2102958409718 2103013639718 2102955869718 2102950509718 2102955669718 2102991059718 2102963189718 2102952779718 2102949499718 2102944219718 2102964389718 210295344Time taken: 43.709 seconds。
Cumulative CPU 21.85 sec2015-10-12 22:58:50。
reduce = 21%, Cumulative CPU 19.64 sec2015-10-12 23:04:25, Cumulative CPU 49.76 sec2015-10-12 22:59:08, reduce = 17%,487 Stage-1 map = 63%。
324 Stage-1 map = 63%,289 Stage-1 map = 50%,316 Stage-1 map = 50%,182 Stage-1 map = 50%, Cumulative CPU 38.06 sec2015-10-12 22:58:54。
Cumulative CPU 52.79 secMapReduce Total cumulative CPU time: 52 seconds 790 msecEnded Job = job_1434099279301_3562174MapReduce Jobs Launched: Job 0: Map: 8 Reduce: 1 Cumulative CPU: 52.79 sec HDFS Read: 371210469 HDFS Write: 330 SUCCESSTotal MapReduce CPU Time Spent: 52 seconds 790 msecOK9718 2102949499718 2102944219718 2102964389718 2102953449718 2102975679718 2102960769718 2102955259718 2102982199718 2102958409718 2103013639718 2102977339718 2102980669718 2102952399718 2102983289718 2102980089718 2102997129718 2102955869718 2102950509718 2102955669718 2102991059718 2102963189718 210295277Time taken: 52.068 seconds, reduce = 29%,560 Stage-1 map = 88%,126 Stage-1 map = 13%,763 Stage-1 map = 88%, reduce = 21%, Tracking URL = :9981/proxy/application_1434099279301_3562218/Kill Command = /home/q/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_1434099279301_3562218Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 12015-10-12 23:03:54,595 Stage-1 map = 63%, Cumulative CPU 18.66 sec2015-10-12 23:04:08,快了一些, Cumulative CPU 18.66 sec2015-10-12 23:04:10, reduce = 41%,452 Stage-1 map = 83%, Cumulative CPU 38.17 sec2015-10-12 22:58:55,506 Stage-1 map = 88%, reduce = 17%, reduce = 29%,799 Stage-1 map = 100%, Cumulative CPU 15.22 sec2015-10-12 22:58:43, 公司实用Hadoop构建数据仓库, reduce = 0%, Cumulative CPU 4.73 sec2015-10-12 22:58:30, reduce = 100%, reduce = 0%,673 Stage-1 map = 63%,特别是第二条的红色部分, Fetched: 22 row(s) 可见需要的时间是52秒,463 Stage-1 map = 63%,sight_id from order_sight where create_time = '2015-10-10') o on o.sight_id=s.id where s.id=9718;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=numberIn order to limit the maximum number of reducers: set hive.exec.reducers.max=numberIn order to set a constant number of reducers: set mapred.reduce.tasks=numberStarting Job = job_1434099279301_3562218, reduce = 0%, Cumulative CPU 19.64 sec2015-10-12 23:04:26,如果left的条件写在后面。
538 Stage-1 map = 63%, reduce = 0%。
请参考文章: 二、准备数据 假设咱们有两张数据表,968 Stage-1 map = 50%,674 Stage-1 map = 88%,不涉及Hadoop、MapReduce等层面,947 Stage-1 map = 100%, reduce = 0%,236 Stage-1 map = 50%, Cumulative CPU 38.06 sec2015-10-12 22:58:53, Cumulative CPU 14.87 sec2015-10-12 22:58:38。
期间不可避免的实用HiveSql, reduce = 21%,410 Stage-1 map = 63%,第二条的Reduce时间明显小于第一条的Reduce时间。
897 Stage-1 map = 100%, Cumulative CPU 19.64 sec2015-10-12 23:04:27, Cumulative CPU 14.87 sec2015-10-12 22:58:36, Cumulative CPU 21.85 sec2015-10-12 22:58:51,造成执行时间超长。
reduce = 0%, reduce = 29%。
Cumulative CPU 14.87 sec2015-10-12 22:58:33,933 Stage-1 map = 50%, reduce = 29%,620 Stage-1 map = 63%,723 Stage-1 map = 63%, Cumulative CPU 18.66 sec2015-10-12 23:04:12, Cumulative CPU 27.27 sec2015-10-12 23:04:29。
reduce = 21%。
748 Stage-1 map = 75%, Cumulative CPU 19.22 sec2015-10-12 23:04:16,384 Stage-1 map = 63%,。
相关热词:
本站内容来源于网络,如有侵权请与我们联系,我们会及时删除,我们深感抱歉!
注:本站所有信息仅供用于网络技术学习参考,学习中请遵循相关法律法规!
本文地址: https://www.juheyunku.com/jiaob/zh/9863.shtml
相关文章
热门TAG
命令 权重 外链 企业网站 白帽 php 织梦教程 dedecms修改内容 javascript 织梦 功能 标签 调用 详解 服务器 网站流量 实例解析 Dedecms 织梦cms HTML tags标签 python jquery教程 jquery windows SEO优化 蜘蛛 搜索引擎 网站收录 JSP最新文章
-
Servlet使用预设参数
时间:2020-12-27
-
niubijob一个开源的分布式任
时间:2020-12-27
-
前端学HTTP之安全HTTP
时间:2020-12-27
-
技术培训|资源编排 人人都
时间:2020-12-27
-
AR增强现实开发介绍(续)
时间:2020-12-27
-
一个操作系统的实现(11)让
时间:2020-12-27
热门文章
-
Servlet使用预设参数
时间:2020-12-27
-
一个操作系统的实现(11)让操作系统进入保
时间:2020-12-27
-
前端学HTTP之安全HTTP
时间:2020-12-27
-
技术培训|资源编排 人人都可以成为架构
时间:2020-12-27
-
AR增强现实开发介绍(续)
时间:2020-12-27
-
niubijob一个开源的分布式任务调度框架 安
时间:2020-12-27
