Python, PySpark, Cloud, Unix, SQL: PySpark

PySpark

1. PySpark Architecture:

Driver Program (sparkContext) --> Cluster Manager -->Worker nodes

Cluster Manager Types

Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
Apache Mesos – Mesons is a Cluster manager that can also run Hadoop MapReduce and PySpark applications.
Hadoop YARN – the resource manager in Hadoop 2.
Kubernetes – an open-source system for automating deployment, scaling, and management of containerized applications.

Pyspark Modules:

PySpark RDD (pyspark.RDD)
PySpark DataFrame and SQL (pyspark.sql)
PySpark Streaming (pyspark.streaming)
PySpark MLib (pyspark.ml, pyspark.mllib)
PySpark GraphFrames (GraphFrames)
PySpark Resource (pyspark.resource) It’s new in PySpark 3.0

Spark Session includes all the APIs available in different contexts –

Spark Context,
SQL Context,
Streaming Context,
Hive Context.

Setup spark in Colab:

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget --continue https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark

Set path:

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"

Locate Spark in the System:

#locate Spark in the system
import findspark
findspark.init()

Spark Session:

from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

# check spark version
spark.version

# Get SparkContext Configurations
configurations = spark.sparkContext.getConf().getAll()
for item in configurations: 
    print(item)

O/p:

('spark.master', 'local') ('spark.app.name', 'Colab') ('spark.driver.port', '40477') ('spark.driver.host', '**********') ('spark.app.id', 'local-***********') ('spark.executor.id', 'driver') ('spark.sql.warehouse.dir', 'file:/content/spark-warehouse') ('spark.ui.port', '4050') ('spark.app.startTime', '1649231261738') ('spark.rdd.compress', 'True') ('spark.serializer.objectStreamReset', '100') ('spark.submit.pyFiles', '') ('spark.submit.deployMode', 'client') ('spark.ui.showConsoleProgress', 'true')

# get specific configuration
# 
print(spark.sparkContext.getConf().get("spark.master"))

Python, PySpark, Cloud, Unix, SQL

Search This Blog

PySpark

No comments:

Post a Comment

Total Pageviews

Followers