Spark Support for Hive

Alation supports Hive on Spark.

Apache Spark is an open source parallel processing framework for running large-scale analytical applications across clustered computers. Apache Spark can process data from various data repositories, including the Hadoop Distributed File System (HDFS), NoSQL databases, and relational data stores, including Apache Hive.

Spark can be configured through Cloudera Distributed Hadoop (CDH) or Hortonworks Data Platform (HDP).

Configuring Hive on Spark on CDH

To configure Hive on Spark, you must be a configurator. Follow the recommendations below to configure Hive to run on Spark on CDH.

  • Configure the Hive client to use the Spark execution engine. For more information, see Managing Hive in CDH documentation.

  • Hive must identify the Spark service to be used. Cloudera Manager sets it automatically to the configured MapReduce or YARN service and the configured Spark service. For more information, see Running Hive on Spark in CDH documentation.

Configuring Hive on Spark on HDP

Spark can be configured with HDP for a kerberized cluster or a non-kerberized cluster.

To configure Spark with HDP for a kerberized cluster, use the following links:

To configure Spark on HDP for a non-kerberized cluster, use the following links: