Idea加载Spark源码,并且在控制台查询SQL

编译源码

  1. 下载Spark源码

    本次案例,我们使用Apche版本,版本为:spark-2.4.5

    下载链接:https://github.com/apache/spark

    20200421更新:

    一般使用spark对应版本的scala编译最好,如果使用不同版本的scala编译,需要修改主pom文件

    <scala.version>2.12.10</scala.version>
    <scala.binary.version>2.12</scala.binary.version>
  2. 编译Spark源码

    在编译Spark源码之前,需要修改一些东西,原因是scope规定provided会报ClassNotFoundException

    • 修改hive-thriftserver模块下的pom.xm文件

      <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-server</artifactId>
      <!-- <scope>provided</scope>-->
      </dependency>
      <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-servlet</artifactId>
      <!-- <scope>provided</scope>-->
      </dependency>
    • 修改主pom.xml文件

      <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-http</artifactId>
      <version>${jetty.version}</version>
      <!-- <scope>provided</scope>-->
      </dependency>
      <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-continuation</artifactId>
      <version>${jetty.version}</version>
      <!-- <scope>provided</scope>-->
      </dependency>
      <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-servlet</artifactId>
      <version>${jetty.version}</version>
      <!-- <scope>provided</scope>-->
      </dependency>
      <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-servlets</artifactId>
      <version>${jetty.version}</version>
      <!-- <scope>provided</scope>-->
      </dependency>
      <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-proxy</artifactId>
      <version>${jetty.version}</version>
      <!-- <scope>provided</scope>-->
      </dependency>
      <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-client</artifactId>
      <version>${jetty.version}</version>
      <!-- <scope>provided</scope>-->
      </dependency>
      <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-util</artifactId>
      <version>${jetty.version}</version>
      <!-- <scope>provided</scope>-->
      </dependency>
      <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-security</artifactId>
      <version>${jetty.version}</version>
      <!-- <scope>provided</scope>-->
      </dependency>
      <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-plus</artifactId>
      <version>${jetty.version}</version>
      <!-- <scope>provided</scope>-->
      </dependency>
      <dependency>
      <groupId>org.eclipse.jetty</groupId>
      <artifactId>jetty-server</artifactId>
      <version>${jetty.version}</version>
      <!-- <scope>provided</scope>-->
      </dependency>

      如果还有其他类似的ClassNotFoundException,都是这个原因引起的,注释即可

    使用git-bash编译

    mvn clean package -DskipTests=true

    ## 经过漫长的等待,出现如下界面时,表示编译成功(忘记保留了,这里先用hive的)
    [INFO] Reactor Summary:
    [INFO]
    [INFO] Hive 1.1.0-cdh5.16.2 ............................... SUCCESS [ 3.119 s]
    [INFO] Hive Classifications ............................... SUCCESS [ 2.406 s]
    [INFO] Hive Shims Common .................................. SUCCESS [ 3.327 s]
    [INFO] Hive Shims 0.23 .................................... SUCCESS [ 3.494 s]
    [INFO] Hive Shims Scheduler ............................... SUCCESS [ 2.423 s]
    [INFO] Hive Shims ......................................... SUCCESS [ 1.463 s]
    [INFO] Hive Common ........................................ SUCCESS [ 8.382 s]
    [INFO] Hive Serde ......................................... SUCCESS [ 8.001 s]
    [INFO] Hive Metastore ..................................... SUCCESS [ 28.285 s]
    [INFO] Hive Ant Utilities ................................. SUCCESS [ 1.668 s]
    [INFO] Spark Remote Client ................................ SUCCESS [ 4.915 s]
    [INFO] Hive Query Language ................................ SUCCESS [01:36 min]
    [INFO] Hive Service ....................................... SUCCESS [ 22.921 s]
    [INFO] Hive Accumulo Handler .............................. SUCCESS [ 5.496 s]
    [INFO] Hive JDBC .......................................... SUCCESS [ 5.797 s]
    [INFO] Hive Beeline ....................................... SUCCESS [ 3.957 s]
    [INFO] Hive CLI ........................................... SUCCESS [ 4.060 s]
    [INFO] Hive Contrib ....................................... SUCCESS [ 4.321 s]
    [INFO] Hive HBase Handler ................................. SUCCESS [ 5.518 s]
    [INFO] Hive HCatalog ...................................... SUCCESS [ 1.399 s]
    [INFO] Hive HCatalog Core ................................. SUCCESS [ 5.933 s]
    [INFO] Hive HCatalog Pig Adapter .......................... SUCCESS [ 4.632 s]
    [INFO] Hive HCatalog Server Extensions .................... SUCCESS [ 4.477 s]
    [INFO] Hive HCatalog Webhcat Java Client .................. SUCCESS [ 4.903 s]
    [INFO] Hive HCatalog Webhcat .............................. SUCCESS [ 7.452 s]
    [INFO] Hive HCatalog Streaming ............................ SUCCESS [ 4.306 s]
    [INFO] Hive HWI ........................................... SUCCESS [ 3.461 s]
    [INFO] Hive ODBC .......................................... SUCCESS [ 3.061 s]
    [INFO] Hive Shims Aggregator .............................. SUCCESS [ 0.840 s]
    [INFO] Hive TestUtils ..................................... SUCCESS [ 1.077 s]
    [INFO] Hive Packaging 1.1.0-cdh5.16.2 ..................... SUCCESS [ 4.194 s]
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 04:22 min
    [INFO] Finished at: 2020-04-12T18:50:46+08:00
    [INFO] ------------------------------------------------------------------------
  3. 将源码导入IDEA

    源码以Maven方式,导入IDEA后,等待依赖加载完成

    在编译之前需要删除spark-sql下的test包下的streaming包,不然会在Build Project时进入这里,引起java.lang.OutOfMemoryError: GC overhead limit exceeded异常

    点击Build Project编译

本地调试Spark SQL

  1. 找到hive-thriftserver模块,在main下,新建resources目录,并标记为资源目录

  2. 拷贝集群上如下配置文件到resources目录中

    hive-site.xml

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
    <property>
    <name>hive.cli.print.header</name>
    <value>true</value>
    </property>
    <property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
    </property>
    <property>
    <name>hive.metastore.uris</name>
    <value>thrift://hadoop:9083</value>
    <description>指向的是运行metastore服务的主机</description>
    </property>
    </configuration>

    服务器需启动 metastore 服务

    hive --service metastore &
  3. 运行SparkSQLCLIDriver

    在运行之前,需要在VM options中添加参数

    -Dspark.master=local[2] -Djline.WindowsTerminal.directConsole=false

    控制台输出如下信息

    Spark master: local[2], Application Id: local-1587372819248
    spark-sql (default)> show databases;
    show databases;
    databaseName
    access_dw
    default
    offline_dw
    store_format
Author: Tunan
Link: http://yerias.github.io/2019/10/19/spark/19/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.