In case if you do not know how to setup Spark on Mac, please refer to the previous story.
Now that you have Spark installed and built on your Mac. Let us make few changes to get the IDE running.
Set the variable in the bash_profile
sudo vim ~/.bash_profile
Now open the PyCharm.
Create a new project, and use Pure Python template.
Now lets create a python file named whatever-you-wanted-to-name.
Add the Spark python library to the interpreter.
For the word count program you would need a text file.
First create a sample text file, I am gonna give some part of the text that I already wrote in this post as the input.
Finally the program,
os.environ["SPARK_HOME"] = "/usr/local/spark"
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile("sample.txt", 1)
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
The first run is mostly a disaster, because we miss many little things.
So if we can take a quick glance at the error, it says that a module named py4j.java_gateway is missing.
So we have to refer that to the Interpreter.
Again open the Preferences, open the current Interpreter settings and add the lib named py4j-0.9-src.zip
Now lets rerun the code.
We can see in the below screen shot, the words and the respective count are visible.
I hope this keeps you busy for the next few days on trying the amazing Apache Spark.
If you’ve reached this, you’ve made it!! Have a great day!
Clap away if this helped you out. It encourages me to write more posts. And thanks for the support.
Create your free account to unlock your custom reading experience.