Category

Apache Spark

Redis As Backend For Spark (Blog)

Redis As backend for Spark

By | Apache Spark | No Comments

First will cover basic introduction of Apache spark & Redis, then we will see how we can use Redis in spark. Here we used Scala for writing code in spark.

Spark :         

Apache Spark is an open-source distributed general-purpose cluster-computing framework. It is a fast, in-memory data processing engine with lots of APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads.

Spark Redis

Redis :

          Redis is an open source , in-memory data structure store, used as a database, cache and message broker. Redis has built-in replication, Lua scripting , LRU eviction, transaction and different level of on-disk persistence , and provide high availability via Redis Sentinel & automatic partitioning with Redis Cluster.

Redis spark

Spark-Redis library provide access to all Redis Data structures like String, List, Set & Hash from Spark RDD. It Also supports Reading & writing with RDD & Dataframe.

To get Spark-Redis Dependency using Maven add below code in pom.xml file.

<dependencies>
<dependency>
<groupId>com.redislabs</groupId>
<artifactId>spark-redis</artifactId>
<version>2.4.0</version>
</dependency>
</dependencies>

For Executing spark application with Redis, we need to pass Spark-Redis jar ( https://jar-download.com/?search_box=spark-redis ) & Redis Credential with spark-shell command. By default Redis use localhost:6379 without a password.

spark-shell --jars <path_to_redis_spark_jar>/spark-redis_<version>.jar --conf "spark.redis.host=localhost" --conf "spark.redis.port=6379"

Then you can import Redis Library in Spark-shell :

scala> import com.redislabs.provider.redis._

scala> import org.apache.spark.sql.SparkSession

scala> val spark = SparkSession.builder()
.appName("SparkRedis")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()

Here You can either pass spark.redis.host & spark.redis.port in Spark-shell command or with SparkSession Object like above.

Reading Data From Redis :

We have inserted Few Keys in Redis, Below are Few ways through which we can fetch those key-value.

·         Using fromRedisKV we can fetch all String value matching with key. Key can be pattern also.

spark redis (Code)

In above example, we are getting all the key-value starting with keys alph*.

Like above There are multiple function to fetch different Type of value. Few of them are below :

  • val hashTypeRDD = sc.fromRedisHash(“keyPattern *”)  //For Fetching Hash Type of value.
  • val listTypeRDD = sc.fromRedisList(“keyPattern*”)  //For Fetching List of value.
  • val setTypeRDD = sc.fromRedisSet(“keyPattern*”)    //For Fetching Set of Value.

Writing Data To Redis :

                                     For writing data to redis use toRedisKV . Here we have created RDD[String,String], Then write it in Redis.

spark redis (Code)

On Redis Shell (To start redis, first start redis-server then start redis-cli from redis src folder/set redis path , to install redis have a look https://tecadmin.net/install-redis-ubuntu/ ):

spark redis (Code)

Like above There are multiple function to write different Type of data in Redis . Few of them are below :

  • sc.toRedisHASH(hashRDD, hashName)  //For storing data in Redis Hash.
  • sc.toRedisLIST(listRDD, listName)  //For storing data in Redis List.
  • sc.toRedisSET(setRDD, setName)    //For storing data in Redis Set.

Writing Spark Dataframe in Redis :

                                                            For writing spark Dataframe in redis, First we will create DataFrame in Spark.

spark redis (Code)

Then Write above Dataframe in Redis,

spark redis (Code)

We can check this data in Redis, As this data is hash so fetching it using HGETALL

spark redis (Code)

Writing Spark Dataframe in Redis :

Then we will read cmpnyDetail table that we have created in Redis earlier in spark using,

redis spark ( Code)

While reading & writing we can specify below options ,

  • Table : Name of the redis table for reading & writing Data.
  • number : Number of partition for reading Dataframe.
  • column : Specify unique column used as Redis Key. By Defult key is auto-generated.
  • ttl : Data time to live, Data doesn’t expire if ttl is less than 1.
  • schema : infer schema from random row, all columns will have String type

Structured Streaming :

                                    Spark-redis support Redis Stream data Structure as a source for Structured Streaming.

The following example reads data from a Redis Stream empStream that has two fields empId and empName:

val employee = spark.readStream
.format("org.apache.spark.sql.redis")
.option("stream.keys", "empStream ")
.schema(StructType(Array(
StructField("empId", StringType),
StructField("empName", FloatType)
))).load()

val query = employee.writeStream
.format("console")
.start()

query.awaitTermination()

 

 

Introduction Apache Spark

Introduction to Apache Spark

By | Apache Spark | No Comments

Apache Spark

Before knowing what is Apache Spark, we need to understand what is Big Data and the challenges while working with Big data.

What is Big Data?

Big Data is just a term for the large volume of data which is growing exponentially with time. As we define the size of such data is very huge and complex that none of the traditional data management tools are able to store it or process it efficiently.

Challenges while working with Big Data

  • Data Storage

    As we know the volume of data is huge and increasing exponentially on a daily basis. To handle it we need a storage system with increasing disk size and compressing the data using multiple machines, which are connected to each other and can share data efficiently.

  • Getting data from various systems

    It is a tough task because of large volume and high velocity. There are millions of sources producing data with high speed. To handle this we need devices that can capture the data effectively and efficiently. For example, sensors that sense data in real-time and sends this information to the cloud for storage.

  • Analyzing and Querying data

    The most difficult task in which not only retrieve the old data but also deal with insights in real-time. To handle this there are several options available. One option is increased processing speed using Distributed Computing. In Distributed computing first, need to build a network of machines or nodes known as “Cluster”. In Cluster, once the task arrives, it gets broken down into sub-tasks and distributes them to different nodes. Finally, aggregate the output of each node to the final output.

Apache Spark

Apache Spark was created at the University of California, Berkeley’s AMPLab in 2009. Later on, it was donated to the Apache Software Foundation. Apache Spark mostly written in Scala and some code Java. Apache Spark provides APIs for programmers which include Java, Scala, R, and Python. In simple words, Apache spark is cluster computing framework which supports in-memory computation for processing, querying and analyzing Big Data.

In-memory computation:

Apache Spark core concept is that it saves and loads the data in and from the RAM rather than from the Disk(Hard Drive). RAM has much higher processing speed than Hard Drive.That why Apache spark speed is 100 times faster than the Hadoop framework because of in-memory computation.
Note– Apache spark is not a replacement of Hadoop. It is actually designed to run on top of Hadoop.

Apache Spark Cluster-Overview:

Blog Image

Image source: https://spark.apache.org/docs/1.1.1/img/cluster-overview.png

Spark Context : All Spark applications run as independent set of processes, coordinated by SparkContext in program. It holds a connection with Spark cluster manager.

Cluster Manager: Its job to allocates resources to each application in the driver program. Below are types of cluster managers supported by Apache Spark.

  • Standalone
  • Mesos
  • YARN

Worker nodes: After SparkContext connects to the cluster manager, it starts assigning job to worker nodes on a cluster which works independently on each task and interact with each other.

Installation of Apache Spark on ubuntu:

1) Install Java

First check Java installed using “ java -version “

Apache Spark (Code)

If not installed then follow run commands

sudo apt-get update
sudo apt-get install default-jdk

2) Install Scala

sudo apt-get install scala

Apache Spark (Code)

To check type scala into your terminal:
scala

Apache Spark (Code)

You should see the scala REPL running. Test it with:
println(“Hello World”)

You can then quit the Scala REPL with
:q

3) Install Spark

We need git for this, so in your terminal type:
sudo apt-get install git

Apache Spark (Code)

Download  latest Apache Spark and untar it
sudo tar xvf spark-2.3.1-bin-hadoop2.7.tgz -C /usr/local/spark

Apache Spark (Code)

Add Spark path to bash file nano ~/.bashrc

Add below code snippet to the bash file
SPARK_HOME=/usr/local/spark
export PATH=$SPARK_HOME/bin:$PATH

Execute below command after editing the bashsrc
source ~/.bashrc

Apache Spark (Code)

Go to the Bin Directory and execute the spark shell ./spark-shel

Apache Spark (Code)

In the next article we will go deep dive in Apache Spark with coding example using scala.