PySpark
1. PySpark Architecture:
Driver Program (sparkContext) --> Cluster Manager -->Worker nodes
Cluster Manager Types
- Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
- Apache Mesos – Mesons is a Cluster manager that can also run Hadoop MapReduce and PySpark applications.
- Hadoop YARN – the resource manager in Hadoop 2.
- Kubernetes – an open-source system for automating deployment, scaling, and management of containerized applications.
Pyspark Modules:
- PySpark RDD (pyspark.RDD)
- PySpark DataFrame and SQL (pyspark.sql)
- PySpark Streaming (pyspark.streaming)
- PySpark MLib (pyspark.ml, pyspark.mllib)
- PySpark GraphFrames (GraphFrames)
- PySpark Resource (pyspark.resource) It’s new in PySpark 3.0
Spark Session includes all the APIs available in different contexts –
- Spark Context,
- SQL Context,
- Streaming Context,
- Hive Context.
Setup spark in Colab:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget --continue https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark
Set path:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"
Locate Spark in the System:
#locate Spark in the system
import findspark
findspark.init()
Spark Session:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.master("local")\
.appName("Colab")\
.config('spark.ui.port', '4050')\
.getOrCreate()
# check spark version
spark.version
# Get SparkContext Configurations
configurations = spark.sparkContext.getConf().getAll()
for item in configurations:
print(item)
O/p:
('spark.master', 'local')
('spark.app.name', 'Colab')
('spark.driver.port', '40477')
('spark.driver.host', '**********')
('spark.app.id', 'local-***********')
('spark.executor.id', 'driver')
('spark.sql.warehouse.dir', 'file:/content/spark-warehouse')
('spark.ui.port', '4050')
('spark.app.startTime', '1649231261738')
('spark.rdd.compress', 'True')
('spark.serializer.objectStreamReset', '100')
('spark.submit.pyFiles', '')
('spark.submit.deployMode', 'client')
('spark.ui.showConsoleProgress', 'true')
# get specific configuration
#
print(spark.sparkContext.getConf().get("spark.master"))
No comments:
Post a Comment