deploy Spark on a cluster/cloud

  1. 云中部署Spark

1. 云中部署Spark

前提是Hadoop2.7搭建完毕。

1.1 下载解压路径设置Spark

1.1.1 spark版本的选择

版本一般选择最新版本。选择pre-built with user provided hadoop,不要下载源文件。
选择对应版本的scala
安装顺序是这样的:先装Linux,再装JDK,再装Hadoop,那么问题来了,要不要单独安装scala还是spark自带了scala?不知道,先安装spark,然后试一下scala就行。

1.1.2 spark下载与解压

执行如下命令进行安装

1
2
3
4
5
wget https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
sudo tar -zxf ~/spark-2.3.1-bin-without-hadoop.tgz -C /usr/local/
cd /usr/local
sudo mv ./spark-2.3.1-bin-without-hadoop/ ./spark
sudo chown -R root:root ./spark

1.1.3 设施spark路径/环境变量

在Mster节点主机的终端中执行如下命令:

1
2
3
4
vim ~/.bashrc
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
source ~/.bashrc

1.2 Spark参数配置

1.2.1 配置slaves文件

在Master节点主机上进行如下操作:
将 slaves.template 拷贝到 slaves

1
2
cd /usr/local/spark/
cp ./conf/slaves.template ./conf/slaves

slaves文件设置Worker节点。编辑slaves内容,把默认内容localhost替换成如下内容:

1
2
3
vim ./conf/slave
slave01
slave02

1.2.2 配置spark-env.sh文件

将 spark-env.sh.template 拷贝到 spark-env.sh

1
cp ./conf/spark-env.sh.template ./conf/spark-env.sh

编辑spark-env.sh,添加如下内容:

1
2
3
4
vim ./conf/spark-env.sh
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SPARK_MASTER_IP=10.25.96.3

注意:这里spark的配置依赖hadoop的地址与主节点的IP address

1.3 压缩发送解压

配置好后,将Master主机上的/usr/local/spark文件夹复制到各个节点上。在Master主机上执行如下命令:

1
2
3
4
5
cd /usr/local/
tar -zcf ~/spark.master.tar.gz ./spark
cd ~
scp ./spark.master.tar.gz slave01:/root
scp ./spark.master.tar.gz slave02:/root

在slave01,slave02节点上分别执行下面同样的操作:

1
2
3
sudo rm -rf /usr/local/spark/
sudo tar -zxf ~/spark.master.tar.gz -C /usr/local
sudo chown -R root /usr/local/spark

1.4 启动Spark集群

启动Hadoop与spark集群

1
2
3
cd /usr/local/hadoop/
sbin/start-all.sh
sbin/start-all.sh

在浏览器上查看Spark独立集群管理器的集群信息
运行sparkPi

1
2
3
/usr/local/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://10.25.96.3:7077 /usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar
or
/usr/local/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster /usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar

进入spark-shell
./bin/spark-shell
试验发现,还是要安装scala的,否则无法进入spark。
保存hadoop and spark package for reuse.

2. scala安装

scala应该只需要在主节点上安装,因为从节点都是运行一个jar包。

1
wget https://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.tgz

2.1 scala应该有对于的JDK版本

1
2
3
4
5
6
7
8
9
10
11
12
13
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
echo oracle-java8-installer shared/accepted-oracle-license-v1-1 select true | sudo /usr/bin/debconf-set-selections
sudo update-java-alternatives -s java-8-oracle
vim ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
source ~/.bashrc
java -version
javac -version
$JAVA_HOME/bin/java -version
ppa 的jdk路径是/usr/lib/jvm/
在配置JDK的路径的时候,scala,hadoop,spark都不需要用到JDK的路径,但是scala依赖JDK,对JDK的版本有要求。

2.2 scala的安装:解压,更新文件路径

1
2
3
4
5
6
7
sudo tar -zxf ~/下载/scala-2.11.12.tgz -C /usr/local   # 解压到/usr/local中
cd /usr/local/
sudo mv ./scala-2.11.12/ ./scala # 将文件夹名改为scala
sudo chown -R root ./scala
vim ~/.bashrc
export PATH=$PATH:/usr/local/scala/bin
source ~/.bashrc

3. Run Spark Application on cluster/cloud

3.1 独立集群管理器

在集群中运行应用程序JAR包

1
2
3
/usr/local/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://master:7077 /usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar
or
/usr/local/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster /usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar

3.2 Hadoop YARN管理器

在集群中运行应用程序JAR包

1
/usr/local/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster /usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar

yarn运行后的结果要在web页面去查看。
修改yarn-site.xml配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
vi /usr/local/hadoop/etc/hadoop/yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>0.0.0.0:8088</value>
</property>
</configuration>
注意:<name>yarn.resourcemanager.hostname</name>的value是master不能改。添加 <name>yarn.resourcemanager.webapp.address</name>
<value>0.0.0.0:8088</value>
重启resourcemanager and nodemanager
/usr/local/hadoop/sbin/yarn-daemon.sh stop resourcemanager
/usr/local/hadoop/sbin/yarn-daemon.sh start resourcemanager
/usr/local/hadoop/sbin/yarn-daemon.sh stop nodemanager
/usr/local/hadoop/sbin/yarn-daemon.sh start nodemanager

重启Hadoop和spark命令,重新提交sparkPi jar包

1
2
3
4
/usr/local/spark/sbin/stop-all.sh
/usr/local/hadoop/sbin/stop-all.sh
/usr/local/hadoop/sbin/start-all.sh
/usr/local/spark/sbin/start-all.sh

在SEU集群中,运行应用程序JAR包

1
2
3
4
5
cd /opt/spark-2.3.0/
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://master:7077 examples/jars/spark-examples_2.11-2.3.0.jar
or
/opt/spark-2.3.0/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster /opt/spark-2.3.0examples/jars/spark-examples_2.11-2.3.0.jar
报错。

4. 局部重装

4.1 重配内网IP

4.2 重新发送配好的hadoop,spark压缩包