Kylin使用Spark构建Cube

Stella981
• 阅读 882

Apache Kylin™是一个开源的分布式分析引擎,提供Hadoop/Spark之上的SQL查询接口及多维分析(OLAP)能力以支持超大规模数据,最初由eBay Inc. 开发并贡献至开源社区。它能在亚秒内查询巨大的Hive表。
下面是单机安装采坑记,直接上配置和问题解决。
找一台干净的机器,把hadoop hive hbase从原有节点分别拷贝一份,主要目的是配置文件,可以不在kylin所在机器启动相关进程。
开源版本搭建,非整合HDP和CDH。
个别问题解决参考其他博客。
官网http://kylin.apache.org/cn/docs/
MapReduce构建Cube的问题也已解决,所以使用MapReduce构建Cube也是正常的。

版本

java-1.8.0-openjdk-1.8.0.191.b12-1.el7_6.x86_64hadoop-2.8.5【官网下载】hbase-1.4.10【官网下载】hive-2.3.5【官网下载】apache-kylin-2.6.3-bin-hbase1x【官网下载】spark-2.3.2【$KYLIN_HOME/spark 通过$KYLIN_HOME/bin/download-spark.sh下载】spark-2.3.2-yarn-shuffle.jar【https://github.com/apache/spark/releases/tag/v2.3.2下载Source code自行编译(Oracle JDK1.8.0_181 hadoop2.7.3)】

环境变量

JAVA_HOME等。

123456789101112131415161718

export HADOOP_HOME=/home/admin/hadoop-2.8.5export PATH="$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH"export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoopexport HBASE_HOME=/home/admin/hbase-1.4.10export PATH="$HBASE_HOME/bin:$HBASE_HOME/sbin:$PATH"export HIVE_HOME=/home/admin/hive-2.3.5export PATH="$HIVE_HOME/bin:$HIVE_HOME/sbin:$PATH"export HCAT_HOME=$HIVE_HOME/hcatalogexport KYLIN_HOME=/home/admin/kylin-2.6.3export KYLIN_CONF_HOME=$KYLIN_HOME/confexport CATALINA_HOME=$KYLIN_HOME/tomcatexport PATH=:$PATH:$KYLIN_HOME/bin:$CATALINE_HOME/binexport tomcat_root=$KYLIN_HOME/tomcatexport hive_dependency=$HIVE_HOME/conf:$HIVE_HOME/lib/*:$HCAT_HOME/share/hcatalog/hive-hcatalog-core-2.3.5.jar

下载spark和上传spark的依赖包

从v2.6.1开始, Kylin不再包含Spark二进制包;需要另外下载Spark,然后设置SPARK_HOME系统变量到Spark安装目录(可以不设置,详见$KYLIN_HOME/bin/find-spark-dependency.sh)

使用脚本下载Spark[下载后的目录位于$KYLIN_HOME/spark]:

1

$ $KYLIN_HOME/bin/download-spark.sh

把Spark依赖的jars打包成一个jar上传到HDFS上面,这里参照官网,另外打包成zip也是可以的:

123

$ jar cv0f spark-libs.jar -C $KYLIN_HOME/spark/jars/ .$ hadoop fs -mkdir -p /kylin/spark/$ hadoop fs -put spark-libs.jar /kylin/spark/

并在$KYLIN_HOME/conf/kylin.properties里面或者$KYLIN_HOME/spark/conf/spark-defaults.conf里面进行配置:

分别是kylin.engine.spark-conf.spark.yarn.archive和
spark.yarn.archive,配置其一即可。

主要配置

$KYLIN_HOME/conf/kylin.properties

有些虽然解除了注释#,但并非必须的,默认值也可以;
另外在环境变量都设置好的情况下,比如HADOOP_CONF_DIR,不需要再配置kylin.env.hadoop-conf-dir。

说明:
SPARK ENGINE CONFIGS下面kylin.engine.spark-conf.xxxx
后面xxxx这种配置,完全可以在$KYLIN_HOME/spark/conf下的spark-defaults.conf里面进行配置。

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349

##### METADATA | ENV ###### The metadata store in hbasekylin.metadata.url=kylin_metadata@hbase### metadata cache sync retry timeskylin.metadata.sync-retries=3### Working folder in HDFS, better be qualified absolute path, make sure user has the right permission to this directorykylin.env.hdfs-working-dir=/kylin### DEV|QA|PROD. DEV will turn on some dev features, QA and PROD has no difference in terms of functions.kylin.env=QA### kylin zk base pathkylin.env.zookeeper-base-path=/kylin##### SERVER | WEB | RESTCLIENT ###### Kylin server mode, valid value [all, query, job]kylin.server.mode=all### List of web servers in use, this enables one web server instance to sync up with other servers.kylin.server.cluster-servers=192.168.100.2:7070### Display timezone on UI,format like[GMT+N or GMT-N]kylin.web.timezone=GMT+8### Timeout value for the queries submitted through the Web UI, in millisecondskylin.web.query-timeout=300000#kylin.web.cross-domain-enabled=true###allow user to export query resultkylin.web.export-allow-admin=truekylin.web.export-allow-other=true### Hide measures in measure list of cube designer, separate by commakylin.web.hide-measures=RAW###max connections of one route#kylin.restclient.connection.default-max-per-route=20###max connections of one rest-client#kylin.restclient.connection.max-total=200##### PUBLIC CONFIG ####kylin.engine.default=2#kylin.storage.default=2#kylin.web.hive-limit=20#kylin.web.help.length=4#kylin.web.help.0=start|Getting Started|http://kylin.apache.org/docs/tutorial/kylin_sample.html#kylin.web.help.1=odbc|ODBC Driver|http://kylin.apache.org/docs/tutorial/odbc.html#kylin.web.help.2=tableau|Tableau Guide|http://kylin.apache.org/docs/tutorial/tableau_91.html#kylin.web.help.3=onboard|Cube Design Tutorial|http://kylin.apache.org/docs/howto/howto_optimize_cubes.html#kylin.web.link-streaming-guide=http://kylin.apache.org/#kylin.htrace.show-gui-trace-toggle=false#kylin.web.link-hadoop=#kylin.web.link-diagnostic=#kylin.web.contact-mail=#kylin.server.external-acl-provider=##### SOURCE ###### Hive client, valid value [cli, beeline]#kylin.source.hive.client=cli### Absolute path to beeline shell, can be set to spark beeline instead of the default hive beeline on PATH#kylin.source.hive.beeline-shell=beeline### Parameters for beeline client, only necessary if hive client is beeline##kylin.source.hive.beeline-params=-n root --hiveconf hive.security.authorization.sqlstd.confwhitelist.append='mapreduce.job.*|dfs.*' -u jdbc:hive2://localhost:10000### While hive client uses above settings to read hive table metadata,## table operations can go through a separate SparkSQL command line, given SparkSQL connects to the same Hive metastore.#kylin.source.hive.enable-sparksql-for-table-ops=false##kylin.source.hive.sparksql-beeline-shell=/path/to/spark-client/bin/beeline##kylin.source.hive.sparksql-beeline-params=-n root --hiveconf hive.security.authorization.sqlstd.confwhitelist.append='mapreduce.job.*|dfs.*' -u jdbc:hive2://localhost:10000##kylin.source.hive.keep-flat-table=false### Hive database name for putting the intermediate flat tableskylin.source.hive.database-for-flat-table=kylin_flat_db### Whether redistribute the intermediate flat table before building#kylin.source.hive.redistribute-flat-table=true###### STORAGE ###### The storage for final cube file in hbasekylin.storage.url=hbase### The prefix of hbase tablekylin.storage.hbase.table-name-prefix=KYLIN_### The namespace for hbase storagekylin.storage.hbase.namespace=default### Compression codec for htable, valid value [none, snappy, lzo, gzip, lz4]kylin.storage.hbase.compression-codec=none### HBase Cluster FileSystem, which serving hbase, format as hdfs://hbase-cluster:8020## Leave empty if hbase running on same cluster with hive and mapreduce##kylin.storage.hbase.cluster-fs=### The cut size for hbase region, in GB.kylin.storage.hbase.region-cut-gb=5### The hfile size of GB, smaller hfile leading to the converting hfile MR has more reducers and be faster.## Set 0 to disable this optimization.kylin.storage.hbase.hfile-size-gb=2##kylin.storage.hbase.min-region-count=1#kylin.storage.hbase.max-region-count=500### Optional information for the owner of kylin platform, it can be your team's email## Currently it will be attached to each kylin's htable attribute#kylin.storage.hbase.owner-tag=whoami@kylin.apache.org##kylin.storage.hbase.coprocessor-mem-gb=3### By default kylin can spill query's intermediate results to disks when it's consuming too much memory.## Set it to false if you want query to abort immediately in such condition.#kylin.storage.partition.aggr-spill-enabled=true### The maximum number of bytes each coprocessor is allowed to scan.## To allow arbitrary large scan, you can set it to 0.#kylin.storage.partition.max-scan-bytes=3221225472### The default coprocessor timeout is (hbase.rpc.timeout * 0.9) / 1000 seconds,## You can set it to a smaller value. 0 means use default.## kylin.storage.hbase.coprocessor-timeout-seconds=0### clean real storage after delete operation## if you want to delete the real storage like htable of deleting segment, you can set it to true#kylin.storage.clean-after-delete-operation=false##### JOB ###### Max job retry on error, default 0: no retrykylin.job.retry=0### Max count of concurrent jobs runningkylin.job.max-concurrent-jobs=10### The percentage of the sampling, default 100%#kylin.job.sampling-percentage=100### If true, will send email notification on job complete##kylin.job.notification-enabled=true##kylin.job.notification-mail-enable-starttls=true##kylin.job.notification-mail-host=smtp.office365.com##kylin.job.notification-mail-port=587##kylin.job.notification-mail-username=kylin@example.com##kylin.job.notification-mail-password=mypassword##kylin.job.notification-mail-sender=kylin@example.com###### ENGINE ###### Time interval to check hadoop job statuskylin.engine.mr.yarn-check-interval-seconds=10##kylin.engine.mr.reduce-input-mb=500##kylin.engine.mr.max-reducer-number=500##kylin.engine.mr.mapper-input-rows=1000000### Enable dictionary building in MR reducer#kylin.engine.mr.build-dict-in-reducer=true### Number of reducers for fetching UHC column distinct values#kylin.engine.mr.uhc-reducer-count=3### Whether using an additional step to build UHC dictionary#kylin.engine.mr.build-uhc-dict-in-additional-step=false###### CUBE | DICTIONARY #####kylin.cube.cuboid-scheduler=org.apache.kylin.cube.cuboid.DefaultCuboidScheduler#kylin.cube.segment-advisor=org.apache.kylin.cube.CubeSegmentAdvisor### 'auto', 'inmem', 'layer' or 'random' for testing#kylin.cube.algorithm=layer### A smaller threshold prefers layer, a larger threshold prefers in-mem#kylin.cube.algorithm.layer-or-inmem-threshold=7### auto use inmem algorithm:## 1, cube planner optimize job## 2, no source record#kylin.cube.algorithm.inmem-auto-optimize=true##kylin.cube.aggrgroup.max-combination=32768##kylin.snapshot.max-mb=300##kylin.cube.cubeplanner.enabled=true#kylin.cube.cubeplanner.enabled-for-existing-cube=true#kylin.cube.cubeplanner.expansion-threshold=15.0#kylin.cube.cubeplanner.recommend-cache-max-size=200#kylin.cube.cubeplanner.mandatory-rollup-threshold=1000#kylin.cube.cubeplanner.algorithm-threshold-greedy=8#kylin.cube.cubeplanner.algorithm-threshold-genetic=23###### QUERY ###### Controls the maximum number of bytes a query is allowed to scan storage.## The default value 0 means no limit.## The counterpart kylin.storage.partition.max-scan-bytes sets the maximum per coprocessor.#kylin.query.max-scan-bytes=0##kylin.query.cache-enabled=true### Controls extras properties for Calcite jdbc driver## all extras properties should undder prefix "kylin.query.calcite.extras-props."## case sensitive, default: true, to enable case insensitive set it to false## @see org.apache.calcite.config.CalciteConnectionProperty.CASE_SENSITIVE#kylin.query.calcite.extras-props.caseSensitive=true## how to handle unquoted identity, defualt: TO_UPPER, available options: UNCHANGED, TO_UPPER, TO_LOWER## @see org.apache.calcite.config.CalciteConnectionProperty.UNQUOTED_CASING#kylin.query.calcite.extras-props.unquotedCasing=TO_UPPER## quoting method, default: DOUBLE_QUOTE, available options: DOUBLE_QUOTE, BACK_TICK, BRACKET## @see org.apache.calcite.config.CalciteConnectionProperty.QUOTING#kylin.query.calcite.extras-props.quoting=DOUBLE_QUOTE## change SqlConformance from DEFAULT to LENIENT to enable group by ordinal## @see org.apache.calcite.sql.validate.SqlConformance.SqlConformanceEnum#kylin.query.calcite.extras-props.conformance=LENIENT### TABLE ACL#kylin.query.security.table-acl-enabled=true### Usually should not modify this#kylin.query.interceptors=org.apache.kylin.rest.security.TableInterceptor##kylin.query.escape-default-keyword=false### Usually should not modify this#kylin.query.transformers=org.apache.kylin.query.util.DefaultQueryTransformer,org.apache.kylin.query.util.KeywordDefaultDirtyHack##### SECURITY ###### Spring security profile, options: testing, ldap, saml## with "testing" profile, user can use pre-defined name/pwd like KYLIN/ADMIN to login#kylin.security.profile=testing### Admin roles in LDAP, for ldap and saml#kylin.security.acl.admin-role=admin### LDAP authentication configuration#kylin.security.ldap.connection-server=ldap://ldap_server:389#kylin.security.ldap.connection-username=#kylin.security.ldap.connection-password=### LDAP user account directory;#kylin.security.ldap.user-search-base=#kylin.security.ldap.user-search-pattern=#kylin.security.ldap.user-group-search-base=#kylin.security.ldap.user-group-search-filter=(|(member={0})(memberUid={1}))### LDAP service account directory#kylin.security.ldap.service-search-base=#kylin.security.ldap.service-search-pattern=#kylin.security.ldap.service-group-search-base=#### SAML configurations for SSO## SAML IDP metadata file location#kylin.security.saml.metadata-file=classpath:sso_metadata.xml#kylin.security.saml.metadata-entity-base-url=https://hostname/kylin#kylin.security.saml.keystore-file=classpath:samlKeystore.jks#kylin.security.saml.context-scheme=https#kylin.security.saml.context-server-name=hostname#kylin.security.saml.context-server-port=443#kylin.security.saml.context-path=/kylin##### SPARK ENGINE CONFIGS ###### Hadoop conf folder, will export this as "HADOOP_CONF_DIR" to run spark-submit## This must contain site xmls of core, yarn, hive, and hbase in one folder# kylin.env.hadoop-conf-dir=/home/admin/hadoop-2.8.5/etc/hadoop### Estimate the RDD partition numberskylin.engine.spark.rdd-partition-cut-mb=100### Minimal partition numbers of rdd#kylin.engine.spark.min-partition=1### Max partition numbers of rdd#kylin.engine.spark.max-partition=5000### Spark conf (default is in spark/conf/spark-defaults.conf)kylin.engine.spark-conf.spark.master=yarnkylin.engine.spark-conf.spark.submit.deployMode=clusterkylin.engine.spark-conf.spark.yarn.queue=defaultkylin.engine.spark-conf.spark.driver.memory=2Gkylin.engine.spark-conf.spark.executor.memory=1Gkylin.engine.spark-conf.spark.executor.instances=40kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024kylin.engine.spark-conf.spark.shuffle.service.enabled=truekylin.engine.spark-conf.spark.eventLog.enabled=truekylin.engine.spark-conf.spark.eventLog.dir=hdfs\:///kylin/spark-historykylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs\:///kylin/spark-historykylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false##### Spark conf for specific jobkylin.engine.spark-conf-mergedict.spark.executor.memory=2Gkylin.engine.spark-conf-mergedict.spark.memory.fraction=0.2### manually upload spark-assembly jar to HDFS and then set this property will avoid repeatedly uploading jar at runtimekylin.engine.spark-conf.spark.yarn.archive=hdfs://master:9000/kylin/spark/spark-libs.jar##kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec### uncomment for HDP##kylin.engine.spark-conf.spark.driver.extraJavaOptions=-Dhdp.version=current##kylin.engine.spark-conf.spark.yarn.am.extraJavaOptions=-Dhdp.version=current##kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=current###### QUERY PUSH DOWN ######kylin.query.pushdown.runner-class-name=org.apache.kylin.query.adhoc.PushDownRunnerJdbcImpl###kylin.query.pushdown.update-enabled=false##kylin.query.pushdown.jdbc.url=jdbc:hive2://sandbox:10000/default##kylin.query.pushdown.jdbc.driver=org.apache.hive.jdbc.HiveDriver##kylin.query.pushdown.jdbc.username=hive##kylin.query.pushdown.jdbc.password=###kylin.query.pushdown.jdbc.pool-max-total=8##kylin.query.pushdown.jdbc.pool-max-idle=8##kylin.query.pushdown.jdbc.pool-min-idle=0##### JDBC Data Source##kylin.source.jdbc.connection-url=##kylin.source.jdbc.driver=##kylin.source.jdbc.dialect=##kylin.source.jdbc.user=##kylin.source.jdbc.pass=##kylin.source.jdbc.sqoop-home=##kylin.source.jdbc.filed-delimiter=|kylin.job.jar=$KYLIN_HOME/lib/kylin-job-2.6.3.jarkylin.coprocessor.local.jar=$KYLIN_HOME/lib/kylin-coprocessor-2.6.3.jar

检查运行环境

1

$ $KYLIN_HOME/bin/check-env.sh

trouble shooting

日志目录:$KYLIN_HOME/logs下面有kylin.log、kylin.out 还有gc日志

问题0:默认使用hbase配置里的ZK-UnknownHostException

检查运行环境就会报。

1

Caused by: java.net.UnknownHostException: 192.168.100.5:12181: invalid IPv6 address

说明hbase里zk的配置是host:port;kylin读取zk配置的方式是单独读取host和port,修改为host和port分开配置。

修改hbase的配置文件,kylin所在机器的配置文件修改一下就行,为了保持一致,全部修改下再重启hbase。

$HBASE_HOME/conf/hbase-site.xml

12345678

<property>  <name>hbase.zookeeper.property.clientPort</name>  <value>12181</value></property><property>  <name>hbase.zookeeper.quorum</name>  <value>192.168.100.5,192.168.100.6,192.168.100.7</value></property>

环境正常后,加载示例数据并启动kylin server

123456

$ $KYLIN_HOME/bin/sample.sh$ $KYLIN_HOME/bin/kylin.sh start...A new Kylin instance is started by admin. To stop it, run 'kylin.sh stop'Check the log at /home/admin/kylin-2.6.3/logs/kylin.logWeb UI is at http://slave1:7070/kylin

问题1:访问http://192.168.100.2:7070/kylin 404

12345

SEVERE: Failed to load keystore type JKS with path conf/.keystore due to /home/admin/kylin-2.6.3/tomcat/conf/.keystore (No such file or directory)java.io.FileNotFoundException: /home/admin/kylin-2.6.3/tomcat/conf/.keystore (No such file or directory)SEVERE: Context [/kylin] startup failed due to previous errorsInvalid character found in method name. HTTP method names must be tokens

https CA证书问题,注释掉$KYLIN_HOME/tomcat/conf/server.xml里https的部分。

123456

<!--<Connector port="7443" protocol="org.apache.coyote.http11.Http11Protocol"           maxThreads="150" SSLEnabled="true" scheme="https" secure="true"           keystoreFile="conf/.keystore" keystorePass="changeit"           clientAuth="false" sslProtocol="TLS" />-->

停止kylin server然后再启动,处理其他问题类似。

12

$ $KYLIN_HOME/bin/kylin.sh stop$ $KYLIN_HOME/bin/kylin.sh start

问题2:jackson jar包冲突

12

2019-08-13T15:44:18,486 ERROR [localhost-startStop-1] org.springframework.web.context.ContextLoader - Context initialization failedorg.springframework.beans.factory.BeanCreationException: Error creating bean with name 'org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter': Instantiation of bean failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter]: Constructor threw exception; nested exception is java.lang.ClassCastException: com.fasterxml.jackson.datatype.joda.JodaModule cannot be cast to com.fasterxml.jackson.databind.Module

hive的依赖版本:

/home/admin/hive-2.3.5/lib/jackson-datatype-joda-2.4.6.jar

kylin的依赖版本:

/home/admin/kylin-2.6.3/tomcat/webapps/kylin/WEB-INF/lib/jackson-databind-2.9.5.jar

将hive的删除[重命名]。

1

mv $HIVE_HOME/lib/jackson-datatype-joda-2.4.6.jar $HIVE_HOME/lib/jackson-datatype-joda-2.4.6.jarback

使用默认ADMIN/KYLIN[全大写]进行登录

用户名密码见$KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/classes/kylinSecurity.xml。

123456789101112131415161718

<bean class="org.springframework.security.core.userdetails.User" id="adminUser">    <constructor-arg value="ADMIN"/>    <constructor-arg            value="$2a$10$o3ktIWsGYxXNuUWQiYlZXOW5hWcqyNAFQsSSCSEWoC/BRVMAUjL32"/>    <constructor-arg ref="adminAuthorities"/></bean><bean class="org.springframework.security.core.userdetails.User" id="modelerUser">    <constructor-arg value="MODELER"/>    <constructor-arg            value="$2a$10$Le5ernTeGNIARwMJsY0WaOLioNQdb0QD11DwjeyNqqNRp5NaDo2FG"/>    <constructor-arg ref="modelerAuthorities"/></bean><bean class="org.springframework.security.core.userdetails.User" id="analystUser">    <constructor-arg value="ANALYST"/>    <constructor-arg            value="$2a$10$s4INO3XHjPP5Vm2xH027Ce9QeXWdrfq5pvzuGr9z/lQmHqi0rsbNi"/>    <constructor-arg ref="analystAuthorities"/></bean>

问题3:使用MapReduce构建示例Cube报 10020 failed

12345678910

org.apache.kylin.engine.mr.exception.MapReduceException: Exception: java.net.ConnectException: Call From slave1/192.168.100.2 to 0.0.0.0:10020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefusedjava.net.ConnectException: Call From slave1/192.168.100.2 to 0.0.0.0:10020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused    at org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:173)    at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)    at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)    at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)    at org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)    at java.lang.Thread.run(Thread.java:748)

10020端口,需要启动MapReduce JobHistory Server。

查看hadoop的mapred-site.xml有无以下配置:

12345678910111213141516

<configuration>  <property>    <name>mapreduce.framework.name</name>    <value>yarn</value>  </property>  <property>    <name>mapreduce.jobhistory.address</name>    <value>master:10020</value>    <description>MapReduce JobHistory Server IPC host:port</description>  </property>  <property>    <name>mapreduce.jobhistory.webapp.address</name>    <value>master:19888</value>    <description>MapReduceJobHistory Server Web UI host:port</description>  </property></configuration>

master节点:

1

$ mr-jobhistory-daemon.sh start historyserver

问题4:spark构建cube点击build后报错

1234567891011121314151617

Caused by: java.lang.NoClassDefFoundError: org/apache/spark/api/java/function/Function        at org.apache.kylin.engine.spark.SparkBatchCubingJobBuilder2.<init>(SparkBatchCubingJobBuilder2.java:53) ~[kylin-engine-spark-2.6.3.jar:2.6.3]        at org.apache.kylin.engine.spark.SparkBatchCubingEngine2.createBatchCubingJob(SparkBatchCubingEngine2.java:44) ~[kylin-engine-spark-2.6.3.jar:2.6.3]        at org.apache.kylin.engine.EngineFactory.createBatchCubingJob(EngineFactory.java:60) ~[kylin-core-job-2.6.3.jar:2.6.3]        at org.apache.kylin.rest.service.JobService.submitJobInternal(JobService.java:234) ~[kylin-server-base-2.6.3.jar:2.6.3]        at org.apache.kylin.rest.service.JobService.submitJob(JobService.java:202) ~[kylin-server-base-2.6.3.jar:2.6.3]        at org.apache.kylin.rest.controller.CubeController.buildInternal(CubeController.java:395) ~[kylin-server-base-2.6.3.jar:2.6.3]        ... 77 moreCaused by: java.lang.ClassNotFoundException: org.apache.spark.api.java.function.Function        at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1928) ~[catalina.jar:7.0.91]        at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1771) ~[catalina.jar:7.0.91]        at org.apache.kylin.engine.spark.SparkBatchCubingJobBuilder2.<init>(SparkBatchCubingJobBuilder2.java:53) ~[kylin-engine-spark-2.6.3.jar:2.6.3]        at org.apache.kylin.engine.spark.SparkBatchCubingEngine2.createBatchCubingJob(SparkBatchCubingEngine2.java:44) ~[kylin-engine-spark-2.6.3.jar:2.6.3]        at org.apache.kylin.engine.EngineFactory.createBatchCubingJob(EngineFactory.java:60) ~[kylin-core-job-2.6.3.jar:2.6.3]        at org.apache.kylin.rest.service.JobService.submitJobInternal(JobService.java:234) ~[kylin-server-base-2.6.3.jar:2.6.3]        at org.apache.kylin.rest.service.JobService.submitJob(JobService.java:202) ~[kylin-server-base-2.6.3.jar:2.6.3]        at org.apache.kylin.rest.controller.CubeController.buildInternal(CubeController.java:395) ~[kylin-server-base-2.6.3.jar:2.6.3]

kylin源码的engine-spark模块,pom.xml中有一些provided的依赖,但是在kylin server启动后并没有在CLASSPATH中找到,所以,简单的方法是把找不到的依赖jar包直接拷贝到$KYLIN_HOME/tomcat/lib下面。

kylin-2.5.2/tomcat/lib/spark-core_2.11-2.1.2.jar(Caused by: java.lang.ClassNotFoundException: org.apache.spark.api.java.function.Function)kylin-2.5.2/tomcat/lib/scala-library-2.11.8.jar(Caused by: java.lang.ClassNotFoundException: scala.Serializable)

12

$ cp $KYLIN_HOME/spark/jars/spark-core_2.11-2.1.2.jar $KYLIN_HOME/tomcat/lib$ cp $KYLIN_HOME/spark/jars/scala-library-2.11.8.jar $KYLIN_HOME/tomcat/lib

重启kylin server生效。

问题5:spark构建cube第二步,找不到HiveConf

在$KYLIN_HOME/bin/kylin.sh中配置HBASE_CLASSPATH_PREFIX。

1234

org.apache.kylin.job.exception.ExecuteException: org.apache.kylin.job.exception.ExecuteException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConfCaused by: org.apache.kylin.job.exception.ExecuteException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConfCaused by: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConfCaused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf

配置HBASE_CLASSPATH_PREFIX,用到了环境变量里的tomcat_root和$hive_dependency。

1

export HBASE_CLASSPATH_PREFIX=${tomcat_root}/bin/bootstrap.jar:${tomcat_root}/bin/tomcat-juli.jar:${tomcat_root}/lib/*:$hive_dependency:$HBASE_CLASSPATH_PREFIX

问题6:spark构建cube第三步,The auxService:spark_shuffle does not exist

日志位置:

/home/admin/hadoop-2.8.5/logs/userlogs/application_1565839225073_0008/container_1565839225073_0008_01_000001

123456789101112131415161718192021222324252627282930313233343536

19/08/15 16:52:06 ERROR yarn.YarnAllocator: Failed to launch executor 87 on container container_1565839225073_0008_01_000088org.apache.spark.SparkException: Exception while starting container container_1565839225073_0008_01_000088 on host slave2        at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:126)        at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:66)        at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:520)        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)        at java.lang.Thread.run(Thread.java:748)Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist        at sun.reflect.GeneratedConstructorAccessor24.newInstance(Unknown Source)        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)        at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:205)        at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:123)        ... 5 more19/08/15 16:52:07 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources19/08/15 16:52:09 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (80) reached)19/08/15 16:52:09 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.RuntimeException: error execute org.apache.kylin.engine.spark.SparkFactDistinct. Root cause: nulljava.lang.RuntimeException: error execute org.apache.kylin.engine.spark.SparkFactDistinct. Root cause: null        at org.apache.kylin.common.util.AbstractApplication.execute(AbstractApplication.java:42)        at org.apache.kylin.common.util.SparkEntry.main(SparkEntry.java:44)        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)        at java.lang.reflect.Method.invoke(Method.java:498)        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:636)Caused by: java.lang.InterruptedException        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)        at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:619)        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1928)

修改hadoop的yarn-site.xml

12345678

<property>  <name>yarn.nodemanager.aux-services</name>  <value>spark_shuffle,mapreduce_shuffle</value></property><property>  <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>  <value>org.apache.spark.network.yarn.YarnShuffleService</value></property>

需要引入YarnShuffleService所在的jar包,否则Class org.apache.spark.network.yarn.YarnShuffleService not found。

下载spark源码,编译一下。

1234

$ wget https://github.com/apache/spark/archive/v2.3.2.tar.gz$ tar zxvf v2.3.2.tar.gz$ cd spark-2.3.2/$ ./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -DskipTests clean package

需要用到的jar包位置在spark-2.3.2/common/network-yarn/target/scala-2.11目录下,spark-2.3.2-yarn-shuffle.jar。

分发到/home/admin/hadoop-2.8.5/share/hadoop/yarn/lib,重启yarn集群。

以上参考Spark官方文档http://spark.apache.org/docs/2.3.2/running-on-yarn.html

123456789

Configuring the External Shuffle ServiceTo start the Spark Shuffle Service on each NodeManager in your YARN cluster, follow these instructions:1. Build Spark with the YARN profile. Skip this step if you are using a pre-packaged distribution.2. Locate the spark-<version>-yarn-shuffle.jar. This should be under $SPARK_HOME/common/network-yarn/target/scala-<version> if you are building Spark yourself, and under yarn if you are using a distribution.3. Add this jar to the classpath of all NodeManagers in your cluster.4. In the yarn-site.xml on each node, add spark_shuffle to yarn.nodemanager.aux-services, then set yarn.nodemanager.aux-services.spark_shuffle.class to org.apache.spark.network.yarn.YarnShuffleService.5. Increase NodeManager's heap size by setting YARN_HEAPSIZE (1000 by default) in etc/hadoop/yarn-env.sh to avoid garbage collection issues during shuffle.6. Restart all NodeManagers in your cluster.

运行截图【spark2.1.2是kylin2.5.2自带的spark版本,上述遇到的问题kylin2.3.6和kylin2.5.2是一致的】

Kylin使用Spark构建Cube

Kylin使用Spark构建Cube

Kylin使用Spark构建Cube

本文作者:知了小巷https://blog.icocoro.me/2019/08/16/1908-kylin-cube-build/

欢迎点赞+收藏+转发朋友圈素质三连

Kylin使用Spark构建Cube

文章不错?点个【在看】吧!** 👇**

本文分享自微信公众号 - 大数据技术与架构(import_bigdata)。
如有侵权,请联系 support@oschina.cn 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。

点赞
收藏
评论区
推荐文章
blmius blmius
3年前
MySQL:[Err] 1292 - Incorrect datetime value: ‘0000-00-00 00:00:00‘ for column ‘CREATE_TIME‘ at row 1
文章目录问题用navicat导入数据时,报错:原因这是因为当前的MySQL不支持datetime为0的情况。解决修改sql\mode:sql\mode:SQLMode定义了MySQL应支持的SQL语法、数据校验等,这样可以更容易地在不同的环境中使用MySQL。全局s
皕杰报表之UUID
​在我们用皕杰报表工具设计填报报表时,如何在新增行里自动增加id呢?能新增整数排序id吗?目前可以在新增行里自动增加id,但只能用uuid函数增加UUID编码,不能新增整数排序id。uuid函数说明:获取一个UUID,可以在填报表中用来创建数据ID语法:uuid()或uuid(sep)参数说明:sep布尔值,生成的uuid中是否包含分隔符'',缺省为
待兔 待兔
6个月前
手写Java HashMap源码
HashMap的使用教程HashMap的使用教程HashMap的使用教程HashMap的使用教程HashMap的使用教程22
Jacquelyn38 Jacquelyn38
3年前
2020年前端实用代码段,为你的工作保驾护航
有空的时候,自己总结了几个代码段,在开发中也经常使用,谢谢。1、使用解构获取json数据let jsonData  id: 1,status: "OK",data: 'a', 'b';let  id, status, data: number   jsonData;console.log(id, status, number )
Wesley13 Wesley13
3年前
4cast
4castpackageloadcsv.KumarAwanish发布:2020122117:43:04.501348作者:KumarAwanish作者邮箱:awanish00@gmail.com首页:
Wesley13 Wesley13
3年前
mysql设置时区
mysql设置时区mysql\_query("SETtime\_zone'8:00'")ordie('时区设置失败,请联系管理员!');中国在东8区所以加8方法二:selectcount(user\_id)asdevice,CONVERT\_TZ(FROM\_UNIXTIME(reg\_time),'08:00','0
Wesley13 Wesley13
3年前
00:Java简单了解
浅谈Java之概述Java是SUN(StanfordUniversityNetwork),斯坦福大学网络公司)1995年推出的一门高级编程语言。Java是一种面向Internet的编程语言。随着Java技术在web方面的不断成熟,已经成为Web应用程序的首选开发语言。Java是简单易学,完全面向对象,安全可靠,与平台无关的编程语言。
Stella981 Stella981
3年前
Django中Admin中的一些参数配置
设置在列表中显示的字段,id为django模型默认的主键list_display('id','name','sex','profession','email','qq','phone','status','create_time')设置在列表可编辑字段list_editable
Wesley13 Wesley13
3年前
MySQL部分从库上面因为大量的临时表tmp_table造成慢查询
背景描述Time:20190124T00:08:14.70572408:00User@Host:@Id:Schema:sentrymetaLast_errno:0Killed:0Query_time:0.315758Lock_
Python进阶者 Python进阶者
1年前
Excel中这日期老是出来00:00:00,怎么用Pandas把这个去除
大家好,我是皮皮。一、前言前几天在Python白银交流群【上海新年人】问了一个Pandas数据筛选的问题。问题如下:这日期老是出来00:00:00,怎么把这个去除。二、实现过程后来【论草莓如何成为冻干莓】给了一个思路和代码如下:pd.toexcel之前把这