Python PATH是怎么一回事?从“import pyspark”报错说起

为何import pyspark报错?让我们从捋一捋Python的环境变量PATH开始!

Spark装好后直接在代码开头import pyspark不出意外是要报错的,没有这个模块?当然不是,只是python找不到这个地址而已,有很quick and dirty的解决方法,就是利用包findspark,只要在开头

1
2
import findspark
findspark.init()

接下来即可顺利import pyspark,不过这个在“import其他依赖之前先运行函数”实在是有碍观赏性。。。什么是“优雅的解决方法”呢?别急,我们先来看看import findspark做了什么吧:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# ensure SPARK_HOME is defined
os.environ['SPARK_HOME'] = spark_home

# ensure PYSPARK_PYTHON is defined
os.environ['PYSPARK_PYTHON'] = python_path

# add pyspark to sys.path
spark_python = os.path.join(spark_home, 'python')
py4j = glob(os.path.join(spark_python, 'lib', 'py4j-*.zip'))[0]
sys.path[:0] = [spark_python, py4j]

if edit_rc:
change_rc(spark_home, spark_python, py4j)

if edit_profile:
edit_ipython_profile(spark_home, spark_python, py4j)

debug一下,原来init()的作用即是将如’/usr/local/spark/python’和’/usr/local/spark/python/lib/py4j-0.9-src.zip’写入 sys.path内,其中可选是否写入’~/.bashrc’或者IPython profile里。再通过查阅一些资料,我们开始验证想法,比如,我们在桌面建立’fuck.py’文件:

echo "print('FUCK')" > fuck.py
那么,只要我们import fuck 成功即打印此文字,我们在Desktop下:

1
2
3
4
frank@mac:Desktop$ python -c "import fuck"
FUCK
frank@mac:Desktop$ python -c "import sys; print(sys.path)"
['', '/Users/frank/Desktop', '/Users/frank/anaconda/lib/python36.zip', '/Users/frank/anaconda/lib/python3.6', '/Users/frank/anaconda/lib/python3.6/lib-dynload', '/Users/frank/anaconda/lib/python3.6/site-packages', '/Users/frank/anaconda/lib/python3.6/site-packages/Sphinx-1.5.1-py3.6.egg', '/Users/frank/anaconda/lib/python3.6/site-packages/aeosa', '/Users/frank/anaconda/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg']

没问题,因为sys.path虽不包括当前目录,但是buildin函数import默认寻找当前目录。

那我们切到其他文件夹?

1
2
3
4
5
6
7
8
frank@mac:tmp$ python

frank@mac:tmp$ python -c "import fuck"
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'fuck'
frank@mac:tmp$ python -c "import sys; print(sys.path)"
['', '/Users/frank/anaconda/lib/python36.zip', '/Users/frank/anaconda/lib/python3.6', '/Users/frank/anaconda/lib/python3.6/lib-dynload', '/Users/frank/anaconda/lib/python3.6/site-packages', '/Users/frank/anaconda/lib/python3.6/site-packages/Sphinx-1.5.1-py3.6.egg', '/Users/frank/anaconda/lib/python3.6/site-packages/aeosa', '/Users/frank/anaconda/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg']

fuck.py既不在当前目录也不在sys.path,那么如何添加呢?有两种方法:一是添加进sys.path里:

1
2
3
>>> sys.path.append('/Users/frank/Desktop')
>>> import fuck
FUCK

不过这个只对当前程序起作用,想要持久效果的话,通过查阅29.1. sys — System-specific parameters and functions — Python 3.6.1 documentation可得,sys.path是通过PYTHONPATH环境变量起作用的,那么把需要的目录添加到PYTHONPATH即可:

1
frank@mac:~$ echo "export PYTHONPATH=/Users/frank/Desktop:$PYTHONPATH" >> ~/.profile && source  ~/.profile

再import即可成功:

1
2
frank@mac:tmp$ python -c "import fuck"
FUCK

综上,需要一劳永逸解决import pyspark失败的问题,那么在`~/.profile’内添加如下即可(py4j-0.10.4-部分版本号可能不同)):

1
2
export SPARK_HOME=/usr/local/spark
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH

source ~/.profile即可。