We understand that giving an interview can sometimes make you nervous, especially when you have to give a big data job interview. Every candidate feels the need to prepare themselves before going for a big data job or spark developer job interview.
It is difficult to predict the types of questions you are going to be asked in the interview. Therefore, to help you out, we have come up with a list of top Apache Spark interview questions and answers that you can prepare before going for your spark developer or big data processing job interview. Check out this article for general interview questions.
What is Apache Spark?
Before going ahead, let us first understand what apache spark is. Apache spark is a flexible data processing framework that is pretty easy to use, and it allows the users of big data professionals to execute streaming efficiently. Apache spark is a fast and more general data processing platform engine. This platform was developed for fast computation and was developed at UC Berkeley in the year 2009. With the help of apache-spark, you can distribute data in the file system across the cluster and process that data in parallel. Moreover, you can easily write an application in Java, Python, or Scala. This platform was developed to overcome the limitations of the Map-Reduce cluster computing paradigm as the spark is able to keep the data in memory. However, MapReduce shuffles the data in and out of the memory disk. Furthermore, spark supports SQL queries, streaming data, and graph data processing. And most importantly, apache-spark does not run on Hadoop as it runs on its own by using storage such as data stored in Cassandra, S3, from which the spark platform can write and read. Apache spark runs 100 times faster than Hadoop MapReduce.
Top Apache Spark Interview Questions and Answers
We are listing the top Apache Spark interview questions and answers that you can prepare before going for your big data job interview.
1. What are the primary features of apache spark?
The key features of apache spark are as follows:
- Lazy evaluation- The concept of lazy evaluation is used by apache spark for delaying the evaluation until it becomes mandatory.
- Support for programming languages- You can write the spark code in four programming languages such as Java, Python, R, and Scala. Moreover, the platform also provides high-level APIs in these programming languages. Spark provides shells in Python and Scala. You can easily access the python and scala shells through the ./bin/pyspark directory and .bin/spark-shell directory, respectively.
- Machine learning- The machine learning feature of Apache spark is useful for big data processing as it removes the need to use separate engines for machine learning and processing.
- Multiple format support- All your multiple data sources such as JSON, Hive, and Parquet are supported by the spark. Moreover, in order to access structured data through spark SQL, you get pluggable mechanisms that are offered by data sources API.
- Speed- Apache spark runs 100 times faster than Hadoop MapReduce. Apache spark is able to achieve this speed through controlled portioning. This means that apache-spark manages the data by means of partitioning that further helps in parallelizing distributed data processing with minimum traffic on the network.
- Hadoop integration- Apache spark provides efficient connectivity with Hadoop. Moreover, using spark is better when it comes to Hadoop MapReduce.
- Real-time processing- Thanks to the memory computation of apache-spark, the computation and processing in real-time and has low latency.
2. What are the advantages of apache spark over Hadoop MapReduce?
This is one of the apache spark interview questions that can be asked in an interview. The following are the advantages of apache spark over Hadoop map-reduce.
- Multitasking- Hadoop only supports batch processing through inbuilt libraries. On the other hand, for performing multiple tasks, apache-spark comes with libraries that are built-in, and you can use them for batch processing, interactive SQL queries, machine learning, and streaming.
- Enhanced speed- when you use apache-spark, you must have noticed that the memory processing speed of spark is 100 times faster than Hadoop map-reduce.
- No disk dependence- Hadoop MapReduce is dependent on disks, and apache spark uses the in-built memory data storage and caching.
3. What is the function of a spark Engine?
One can use a spark engine to distribute, schedule, and monitor the data application across the cluster.
4. What do you mean by partitions?
Partition means a smaller and logical division of information or data. Partition is similar to “split” in MapReduce. Partitioning can be defined as a process for speeding up the processing of data by deriving logical units of data. All the spark data is a partitioned RDD.
5. What is the concept of resilient distributed Datasets? Also, state the method for creating a new RDD in apache spark.
A fault-tolerance collection or group of operational elements, which are capable of running in parallel, is known as an RDD (resilient distributed datasets). So, if there is any partitioned data in an RDD, then it is distributed and immutable.
We can say that RDDs are small portions of data that may be stored in the memory, which is distributed over numerous nodes. Moreover, spark uses lazy evaluation, and thereby RDDs are lazily evaluated, which helps spark achieve tremendous speed. There are two types of RDDs.
- Hadoop datasets- These types of RDDs involve performing functions on every file record stored in a Hadoop distributed file system (HDFS) or other storage systems.
- Parallelized collections- These are RDDs that are running parallel to each other.
Now, if we talk about creating a new RDD in apache-spark, then there are two ways.
- You can create an RDD by parallelizing a collection in the driver program. This method makes use of the spark contexts parallelize method.
- Through external storage by loading an external dataset that includes HBase, HDFS, and a shared file system.
6. What are the operations that are supported by RDD?
The functions supported by RDD are transformations and actions.
7. What are transformations in spark?
Transformations in spark mean the functions that are applied to RDDs, which result in a new RDD. However, the functions are not executed until there is an action occurrence. Some examples of transformations are map() and filter () functions, where the map() function is repeated across every line in the RDD and splits to form a new RDD. On the other hand, the filter() function helps in creating a new RDD by choosing elements from the present spark RDD.
8. What do you mean by actions in spark?
The actions in spark mean bringing the data back from an RDD to a local machine. Actions in spark are basically RDD operations that give non-RDD values. Some examples of actions are the reduce() function, which is an action that you can repeatedly implement until one value remains. Then there is take() action that takes all the values from an RDD and takes it to the local file system.
9. What are the functions of the spark core?
Some of the functions of spark core are as follows:
- Monitoring jobs
- Provides fault-tolerance
- Job scheduling
- Interaction with storage systems
- Memory management
10. What do you mean by RDD lineage?
Spark RDD lineage is used to rebuild the lost data as spark does not support the replication of data in memory. Therefore, spark RDD lineage helps in reconstructing the lost data partitions.
11. What do you mean by spark driver?
The program that runs on the master node of a machine and declares the actions and transformations on data RDDs is known as the spark driver program. In other words, a spark driver helps in creating spark context and delivering RDD graphs to master, where the standalone cluster manager is running.
12. Define the term spark streaming.
One of the most asked apache spark interview questions is defining the term spark streaming. Spark streaming is an extension to the Spark API that allows users to stream live data streams. Data is processed from different data sources like flume, Kinesis, and Kafka. This processed data is then stored on file systems, live dashboards, and databases. The processing of data is similar to batch processing when it comes to the input data.
13. What are the functions of MLlib in Apache Spark?
MLlib is a machine learning library that is provided by the spark. MLlib aims to make machine learning easy and scalable as it involves common learning algorithms, and it uses cases such as cluster manager for clustering, regression filtering, dimensional reduction.
14. What do you mean by Spark SQL?
Spark SQL is also known as a shark, and it is a novel module that helps in performing structured data processing. Spark can perform SQL queries on data via this module. Moreover, spark SQL supports a different RDD called SchemaRDD, which is composed of row objects and schema objects that define the type of data in different columns in each row.
15. What are the functions of Spark SQL?
The functions of spark SQL are as follows:
- Spark SQL can load the data from several structured sources.
- Spark SQL can perform data query by using the SQL statements, in both spark programs and through external tools connected to spark SQL with the help of standard database connectors, for example, using many big data tools like a tableau.
- It provides integration between the regular python/Java/Scala code and SQL.
16. What do you mean by YARN in Apache Spark?
Another common apache spark interview questions that can be asked in an interview is defining YARN. One of the key features of spark is YARN, it is similar to Hadoop, and it provides a resource management platform that delivers scalable operations across the cluster. Moreover, if you run apache spark on YARN, you need a binary distribution of spark built on YARN support.
17. What do you mean by Spark Executor?
When you connect spark context to the cluster manager, then it obtains an executor on the nodes in the cluster. Spark executors help in running computations and store the data on worker nodes. The last functions by spark context are moved to executors for their execution.
18. Mention the different types of cluster managers in spark?
There are three types of cluster managers that are supported by the Spark framework.
- Standalone- it is a basic cluster manager that helps in setting up a cluster.
- Apache Mesos- this is the most commonly used cluster manager in Hadoop MapReduce and spark application.
- YARN- this is a cluster manager that is responsible for resource management in Hadoop.
19. What do you mean by the Parquet file?
A columnar format file is known as a parquet file, which is supported by several other data processing systems. With the parquet file’s help, Spark SQL performs the read and write operations and considers the parquet file to be the best data analytics format so far.
20. Is it necessary to install spark on all the nodes of the YARN cluster while you run apache spark on YARN?
It is not necessary to install spark on all the nodes of the YARN cluster as apache-spark runs on top of YARN.
21. State the components of the spark ecosystem?
The following are the components of the spark ecosystem.
- MLib- It is the machine learning library for machine learning.
- GraphX- It is for implementing graphs and graph-parallel computation.
- Spark core- it is the base engine, which is used for parallel and distributed data processing on a large scale.
- Spark streaming- Spark streaming helps in the real-time processing of streaming data.
- Spark SQL- it helps in integrating the functional programming API of spark along with rational processing.
22. Can you use apache spark for analyzing and accessing the data stored on the Cassandra database?
Using spark for analyzing and accessing the data stored on the Cassandra database is possible by using the spark Cassandra connector. You need to connect Cassandra to the spark project. Therefore, when you connect Cassandra with apache-spark, it allows you to make queries much faster by reducing the usage of the network for sending data between the Cassandra nodes and spark executors.
23. Define the worker node?
A worker node is a node that is able to run the code in a cluster. Therefore, the driver program has to listen and accept the same from the executors for incoming connections. Moreover, the driver program has to be network addressable from the worker nodes.
24. What is the procedure to connect apache spark with apache mesos?
The procedure to connect apache spark with apache Mesos is as follows:
- The first step is to mesos configure the spark driver program for connecting it with apache mesos.
- You have to place the spark binary package at a location that is accessible by apache mesos.
- Now install apache-spark at the same location as apache mesos.
- For pointing to the location where apache spark is installed, you have to configure the spark Mesos executor home property.
25. What are the ways for minimizing data transfers while you are working with spark?
For writing spark programs that are capable of running fast and are reliable, it is important to minimize the data transfers. These are the ways for minimizing data transfers while you are working with apache spark.
- Use accumulators- for minimizing data transfers, you can use accumulators as they provide a way to update the variable values while you execute the same in parallel.
- Avoiding- You can minimize data transfers by avoiding repartition, Bykey operations, and other operations that are responsible for triggering shuffles.
- Use broadcast variables- you can enhance the efficiency of joins between the small and large RDD by using the broadcast variables.
26. Explain broadcast variables in apache-spark and what are their uses?
One of the most asked apache spark interview questions is about broadcast variables. Broadcast variables in apache spark are pretty useful as instead of shipping a copy of a variable with tasks; a broadcast variable helps to keep a read-only cached version of the variable.
Moreover, every node gets a copy of a large input dataset as it is provided by broad-cast variables. To reduce communication costs, apache-spark uses effective broadcast algorithms for distributing the broadcast variables.
Another use of broadcast variables is to reduce the need for shipping the copies of a variable of each task. To enhance retrieval efficiency, the broadcast variables also help store a lookup table inside the memory compared with RDD lookup().
27. Are checkpoints provided by Apache Spark?
The checkpoints are provided by apache spark. Checkpoints allow a program to run 24/7 and make them resilient for failures. In order to recover RDDs from a failure, lineage graphs are used.
Moreover, to add and manage the checkpoints, apache-spark is equipped with an API. The user can thereby decide which data to add to the checkpoint. Furthermore, checkpoints are preferred more over the lineage graphs as lineage graphs have wider dependencies.
28. Mention the levels of persistence in Apache Spark?
There are different levels of persistence in apache spark to store the RDDs on disk, memory, or a combination of both disk and memory with different levels of replication. Following are the levels of persistence in spark:
- Memory and disk- The memory and disk stores the RDD in the JVM as deserialized JAVA objects. In case the RDD does not fit in the memory, then some parts of RDD are stored on the disk.
- DIsk only- As the name suggests, disk only persistence level stores RDD partitions on the disk only.
- Memory only ser- The memory only ser stored RDD along with a one-byte array per partition and as serialized JAVA objects.
- Memory and disk ser- THis persistence level is pretty much similar to the memory only ser with some difference of partitions stored on the disk when they are unable to fit in the memory.
- Memory only- It stores the RDD in the JVM as deserialized JAVA objects. In case the RDD does not fit in the memory, then some parts of RDD will not be cached and will have to be recomputed on the fly.
- Off heap- This persistence level is similar to memory only ser, but it stores the data on off-heap memory.
29. What are the limitations of using apache spark?
Some of the limitations of using apache spark are as follows:
- Apache spark does not have a built-in file management system. Therefore, you need to integrate spark with other platforms such as Hadoop for a file management system.
- There is no support for the real-time data streaming process. In apache-spark, the live data stream is partitioned into batches and, even after processing, are converted into batches. Therefore, we can say that spark streaming is micro-batch processing and does not support real-time data processing.
- The number of available algorithms on spark is less.
- The record-based window criteria do not support spark streaming.
- You cannot run everything on a single node, and the work has to be distributed over several clusters.
- If you use spark for cost-efficient big data processing, then the in-built memory ability becomes challenging.
30. State the way to trigger automated clean-ups in apache spark other than ‘spark.cleaner.ttl’?
Another way to trigger automated clean-ups in spark is to distribute the long-running jobs in different batches and write the intermediary outcome on the disk.
31. Mention the role of Akka in spark?
Akka does the scheduling process in spark. With the help of a scheduling process, the workers and superiors can send or receive messages for tasks.
32. Explain schemaRDD in apache spark RDD?
The RDD that carries several row objects like wrappers around the regular string or integer arrays with the schema information about the type of data in each column is known as ShemaRDD. However, it is renamed as DataFrame API now.
33. What is the reason for designing schemaRDD?
The reason for designing SchemaRDD is to help developers in code debugging and unit testing on the sparkSQL core module.
34. What is the procedure for removing the elements when the key is present in any other RDD?
You can easily remove the elements when the key is present in any other RDD using the subtract key () function.
35. State the difference between persist() and cache()
The users are able to specify the level of storage with the help of persist (), and on the other hand, cache () uses the default storage level.
36. What do you mean by Executor memory in a spark application?
For a spark executor, every spark application has a fixed number of cored and heap size. The spark executor memory, which the spark.executor.memory property of the -executor-memory flag controls, is referred to as the heap size.
Each worker node will have one executor on the spark application. The application utilizes some memory of the worker node, and the executor memory helps in measuring the amount of memory that the application utilizes.
37. What are the ways for identifying the given operation to be a transformation or action in a spark program?
The users can easily identify the operation to be a transformation or action based on the return type.
- An operation is a transformation when the return type is the same as the RDD.
- An operation is an action when the return type is not the same as RDD.
38. What do you think are the common mistakes that the spark developers make?
Some of the common mistakes that the spark developers make are as follows:
- The spark developers might make some mistakes while managing the directed acyclic graphs (DAG’s).
- The spark developers may also make some mistakes while maintaining the required size for shuffle blocks.
39. Mention some companies that are using spark streaming?
Some of the companies that are using spark streaming are as follows:
40. Can we use apache spark for reinforcement learning?
Apache spark is not preferred for reinforcement learning as it is suitable for only simple machine learning algorithms such as clustering, regression, and classification.
41. How does spark handle the monitoring and logging in standalone mode?
Apache spark uses a web-based user interface to monitor the cluster in standalone mode, which displays the cluster and job statistics. Moreover, the log result for every job is written to the working directory of the slave nodes.
42. State the common workflow of a spark program.
The common workflow of a spark program is as follows:
- The first step involved in a spark program is to create input RDD’s from the external data.
- Creating new transformed RDD’s based on the business logic by using several RDD transformations like Filter().
- Persist() every intermediate RDD’s that may have to be reused in the future.
- To start the parallel computation, use various RDD actions such as first(), count(). Spark will optimize and execute these actions, thereby.
43. What are the differences between spark SQL and Hive?
The following are the differences between spark SQL and Hive.
- If you use spark SQL, then you may know that it is faster than Hive.
- You can execute a Hive query in spark SQL. However, you cannot execute the SQL query in HIve.
- Hive is a framework, whereas Spark SQL is a library.
- It is not necessary for creating a metastore in SQL. However, it is compulsory to create a metastore in Hive.
- Spark SQL can automatically infer the schema, but in Hive, you need to do it manually as the schema needs to be explicitly declared.
44. What do you mean by receivers in spark streaming?
The special entities in spark streaming are known as receivers as they consume data from several data sources and locate them in apache spark. The streaming contexts create the receivers as long-running tasks are scheduled to run in a round-robin manner, with every receiver obtaining a single core.
45. What do you mean by a sliding window in spark? Explain with an example.
A sliding window in spark is used for specifying every batch of spark streaming, which has to go through processing. For example, with the help of a sliding window, you can set the batch processing intervals, and the specific batches will be processed in those intervals.
We hope you enjoyed the above-mentioned apache spark interview questions and answers. Now, you can easily crack all your big data job interview questions. Go through all the apache spark interview questions and answers to get an idea about the type of interview questions that are asked in a big data job interview.