This is the second article of a four-part series about Apache Spark on YARN. As Apache Spark is an in-memory distributed data processing engine, the application performance is heavily dependent on resources such as executors, cores, and memory allocated. The resources for the application depends on the application characteristics such as storage and computation.
Few performance bottlenecks were identified in the SFO Fire department call service dataset use case with YARN cluster manager. One of the bottlenecks was about improper usage of resources in YARN cluster and execution of the application based on default Spark configuration.