编译源码
下载Spark源码
本次案例,我们使用Apche版本,版本为:spark-2.4.5
下载链接:https://github.com/apache/spark
20200421更新:
一般使用spark对应版本的scala编译最好,如果使用不同版本的scala编译,需要修改主pom文件
<scala.version>2.12.10</scala.version>
<scala.binary.version>2.12</scala.binary.version>编译Spark源码
在编译Spark源码之前,需要修改一些东西,原因是scope规定provided会报ClassNotFoundException
修改hive-thriftserver模块下的pom.xm文件
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-server</artifactId>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-servlet</artifactId>
<!-- <scope>provided</scope>-->
</dependency>修改主pom.xml文件
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-http</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-continuation</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-servlet</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-servlets</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-proxy</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-client</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-util</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-security</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-plus</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-server</artifactId>
<version>${jetty.version}</version>
<!-- <scope>provided</scope>-->
</dependency>如果还有其他类似的ClassNotFoundException,都是这个原因引起的,注释即可
使用
git-bash
编译mvn clean package -DskipTests=true
## 经过漫长的等待,出现如下界面时,表示编译成功(忘记保留了,这里先用hive的)
[INFO] Reactor Summary:
[INFO]
[INFO] Hive 1.1.0-cdh5.16.2 ............................... SUCCESS [ 3.119 s]
[INFO] Hive Classifications ............................... SUCCESS [ 2.406 s]
[INFO] Hive Shims Common .................................. SUCCESS [ 3.327 s]
[INFO] Hive Shims 0.23 .................................... SUCCESS [ 3.494 s]
[INFO] Hive Shims Scheduler ............................... SUCCESS [ 2.423 s]
[INFO] Hive Shims ......................................... SUCCESS [ 1.463 s]
[INFO] Hive Common ........................................ SUCCESS [ 8.382 s]
[INFO] Hive Serde ......................................... SUCCESS [ 8.001 s]
[INFO] Hive Metastore ..................................... SUCCESS [ 28.285 s]
[INFO] Hive Ant Utilities ................................. SUCCESS [ 1.668 s]
[INFO] Spark Remote Client ................................ SUCCESS [ 4.915 s]
[INFO] Hive Query Language ................................ SUCCESS [01:36 min]
[INFO] Hive Service ....................................... SUCCESS [ 22.921 s]
[INFO] Hive Accumulo Handler .............................. SUCCESS [ 5.496 s]
[INFO] Hive JDBC .......................................... SUCCESS [ 5.797 s]
[INFO] Hive Beeline ....................................... SUCCESS [ 3.957 s]
[INFO] Hive CLI ........................................... SUCCESS [ 4.060 s]
[INFO] Hive Contrib ....................................... SUCCESS [ 4.321 s]
[INFO] Hive HBase Handler ................................. SUCCESS [ 5.518 s]
[INFO] Hive HCatalog ...................................... SUCCESS [ 1.399 s]
[INFO] Hive HCatalog Core ................................. SUCCESS [ 5.933 s]
[INFO] Hive HCatalog Pig Adapter .......................... SUCCESS [ 4.632 s]
[INFO] Hive HCatalog Server Extensions .................... SUCCESS [ 4.477 s]
[INFO] Hive HCatalog Webhcat Java Client .................. SUCCESS [ 4.903 s]
[INFO] Hive HCatalog Webhcat .............................. SUCCESS [ 7.452 s]
[INFO] Hive HCatalog Streaming ............................ SUCCESS [ 4.306 s]
[INFO] Hive HWI ........................................... SUCCESS [ 3.461 s]
[INFO] Hive ODBC .......................................... SUCCESS [ 3.061 s]
[INFO] Hive Shims Aggregator .............................. SUCCESS [ 0.840 s]
[INFO] Hive TestUtils ..................................... SUCCESS [ 1.077 s]
[INFO] Hive Packaging 1.1.0-cdh5.16.2 ..................... SUCCESS [ 4.194 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:22 min
[INFO] Finished at: 2020-04-12T18:50:46+08:00
[INFO] ------------------------------------------------------------------------将源码导入IDEA
源码以Maven方式,导入IDEA后,等待依赖加载完成
在编译之前需要删除spark-sql下的test包下的streaming包,不然会在
Build Project
时进入这里,引起java.lang.OutOfMemoryError: GC overhead limit exceeded
异常点击
Build Project
编译
本地调试Spark SQL
找到
hive-thriftserver
模块,在main
下,新建resources
目录,并标记为资源目录拷贝集群上如下配置文件到
resources
目录中hive-site.xml
<configuration>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop:9083</value>
<description>指向的是运行metastore服务的主机</description>
</property>
</configuration>服务器需启动 metastore 服务
hive --service metastore &
运行
SparkSQLCLIDriver
在运行之前,需要在VM options中添加参数
-Dspark.master=local[2] -Djline.WindowsTerminal.directConsole=false
控制台输出如下信息
Spark master: local[2], Application Id: local-1587372819248
spark-sql (default)> show databases;
show databases;
databaseName
access_dw
default
offline_dw
store_format