1. 云中部署Spark
前提是Hadoop2.7搭建完毕。
1.1 下载解压路径设置Spark
1.1.1 spark版本的选择
版本一般选择最新版本。选择pre-built with user provided hadoop,不要下载源文件。
选择对应版本的scala
安装顺序是这样的:先装Linux,再装JDK,再装Hadoop,那么问题来了,要不要单独安装scala还是spark自带了scala?不知道,先安装spark,然后试一下scala就行。
1.1.2 spark下载与解压
执行如下命令进行安装1
2
3
4
5wget https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
sudo tar -zxf ~/spark-2.3.1-bin-without-hadoop.tgz -C /usr/local/
cd /usr/local
sudo mv ./spark-2.3.1-bin-without-hadoop/ ./spark
sudo chown -R root:root ./spark
1.1.3 设施spark路径/环境变量
在Mster节点主机的终端中执行如下命令:1
2
3
4vim ~/.bashrc
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
source ~/.bashrc
1.2 Spark参数配置
1.2.1 配置slaves文件
在Master节点主机上进行如下操作:
将 slaves.template 拷贝到 slaves1
2cd /usr/local/spark/
cp ./conf/slaves.template ./conf/slaves
slaves文件设置Worker节点。编辑slaves内容,把默认内容localhost替换成如下内容:1
2
3vim ./conf/slave
slave01
slave02
1.2.2 配置spark-env.sh文件
将 spark-env.sh.template 拷贝到 spark-env.sh1
cp ./conf/spark-env.sh.template ./conf/spark-env.sh
编辑spark-env.sh,添加如下内容:
1
2
3
4vim ./conf/spark-env.sh
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SPARK_MASTER_IP=10.25.96.3
注意:这里spark的配置依赖hadoop的地址与主节点的IP address
1.3 压缩发送解压
配置好后,将Master主机上的/usr/local/spark文件夹复制到各个节点上。在Master主机上执行如下命令:1
2
3
4
5cd /usr/local/
tar -zcf ~/spark.master.tar.gz ./spark
cd ~
scp ./spark.master.tar.gz slave01:/root
scp ./spark.master.tar.gz slave02:/root
在slave01,slave02节点上分别执行下面同样的操作:1
2
3sudo rm -rf /usr/local/spark/
sudo tar -zxf ~/spark.master.tar.gz -C /usr/local
sudo chown -R root /usr/local/spark
1.4 启动Spark集群
启动Hadoop与spark集群1
2
3cd /usr/local/hadoop/
sbin/start-all.sh
sbin/start-all.sh
在浏览器上查看Spark独立集群管理器的集群信息
运行sparkPi1
2
3/usr/local/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://10.25.96.3:7077 /usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar
or
/usr/local/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster /usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar
进入spark-shell./bin/spark-shell
试验发现,还是要安装scala的,否则无法进入spark。
保存hadoop and spark package for reuse.
2. scala安装
scala应该只需要在主节点上安装,因为从节点都是运行一个jar包。1
wget https://downloads.lightbend.com/scala/2.11.12/scala-2.11.12.tgz
2.1 scala应该有对于的JDK版本
1 | sudo add-apt-repository ppa:webupd8team/java |
2.2 scala的安装:解压,更新文件路径
1 | sudo tar -zxf ~/下载/scala-2.11.12.tgz -C /usr/local # 解压到/usr/local中 |
3. Run Spark Application on cluster/cloud
3.1 独立集群管理器
在集群中运行应用程序JAR包1
2
3/usr/local/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://master:7077 /usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar
or
/usr/local/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster /usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar
3.2 Hadoop YARN管理器
在集群中运行应用程序JAR包1
/usr/local/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster /usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar
yarn运行后的结果要在web页面去查看。
修改yarn-site.xml配置1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23vi /usr/local/hadoop/etc/hadoop/yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>0.0.0.0:8088</value>
</property>
</configuration>
注意:<name>yarn.resourcemanager.hostname</name>的value是master不能改。添加 <name>yarn.resourcemanager.webapp.address</name>
<value>0.0.0.0:8088</value>
重启resourcemanager and nodemanager
/usr/local/hadoop/sbin/yarn-daemon.sh stop resourcemanager
/usr/local/hadoop/sbin/yarn-daemon.sh start resourcemanager
/usr/local/hadoop/sbin/yarn-daemon.sh stop nodemanager
/usr/local/hadoop/sbin/yarn-daemon.sh start nodemanager
重启Hadoop和spark命令,重新提交sparkPi jar包1
2
3
4/usr/local/spark/sbin/stop-all.sh
/usr/local/hadoop/sbin/stop-all.sh
/usr/local/hadoop/sbin/start-all.sh
/usr/local/spark/sbin/start-all.sh
在SEU集群中,运行应用程序JAR包1
2
3
4
5cd /opt/spark-2.3.0/
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://master:7077 examples/jars/spark-examples_2.11-2.3.0.jar
or
/opt/spark-2.3.0/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster /opt/spark-2.3.0examples/jars/spark-examples_2.11-2.3.0.jar
报错。