PySpark

Pyspark独立程序编程，不像scala，需要特定嵌套文件结构，也不需要sbt打包。
PySpark是整合在spark安装包中，不需要单独安装。

在virtual machine上建立git仓库，关联远程仓库
git status
git init
git add .
git commit -m “Learn PySpark”
git remote add origin https://Xinyuan_John_Chen@bitbucket.org/MoocaChen/ob-ue_admm_decomposition-virtual.git
git push origin master
git push -u origin master
git pull –rebase origin master

RDD
RDD可以持久化，保持在内存中，第二次行动操作，不需要触发从头到尾的计算，只需要重复使用上面缓存中的rdd。
RDD分区的一个分区原则是使得分区的个数尽量等于集群中的CPU核心（core）数目。
键值对RDD。
传递内置函数
https://stackoverflow.com/questions/48304008/operation-inside-map-function-in-pyspark
开发工具Visual Studio Code （学校网络阻止下载）
使用开发工具Visual Studio Code 编写Spark应用程序(Python版)
vscode 配置 python3开发环境
 VS Code搭建Python开发环境
 使用VS Code开发Python
Getting Started with Python in VS Code：官方文档，里面也有插件介绍。
使用 VS Code 进行 Python 编程：介绍了基本配置。
Python编辑器：Visual Studio Code：了解里面有有用的插件。
或许可以在Ubuntu上安装anaconda。
基于windows的pyspark。
import os
os.environ[‘JAVA_HOME’] = ‘C:\Program Files\Java\jdk1.8.0_201’

Xinyuan's blog