Apache Spark integration

Apache Spark is an open-source, general-purpose cluster-computing framework.

Apache Spark is an open-source, general-purpose cluster-computing framework. See the Apache Spark website for more information. Vertica provides a Spark connector that you install into Spark that lets you transfer data between Vertica and Spark.

Using the connector, you can:

  • Copy large volumes of data from Spark DataFrames to Vertica tables. This feature lets you save your Spark analytics in Vertica.

  • Copy data from Vertica to Spark RDDs or DataFrames for use with Python, R, Scala and Java. The connector efficiently pushes down column selection and predicate filtering to Vertica before loading the data.

The connector lets you use Spark to preprocess data for Vertica and to use Vertica data in your Spark application. You can even round-trip data from Vertica to Spark—copy data from Vertica to Spark for analysis, and then save the results of that analysis back to Vertica.

See the spark-connector project on GitHub for more information about the connector.

How the connector works

The Spark connector is a library that you incorporate into your Spark applications to read data from and write data to Vertica. When transferring data, the connector uses an intermediate storage location as a buffer between the Vertica and Spark clusters. Using the intermediate storage location lets both Vertica and Spark use all of the nodes in their clusters to transfer data in parallel.

When transferring data from Vertica to Spark, the connector tells Vertica to write the data as Parquet files in the intermediate storage location. As part of this process, the connector pushes down the required columns and any Spark data filters into Vertica as SQL. This push down lets Vertica pre-filter the data so it only copies the data that Spark needs. Once Vertica finishes copying the data, the connector has Spark load it into DataFrames from the intermediate location.

When transferring data from Spark to Vertica, the process is reversed. Spark writes data into the intermediate storage location. It then connects to Vertica and runs a COPY statement to load the data from the intermediate location into a table.

Getting the connector

The Spark connector is an open source project. For the latest information on it, visit the spark-connector project on GitHub. To get an alert when there are updates to the connector, you can log into a GitHub account and click the Notifications button on any of the project's pages.

You have three options to get the Spark connector:

  • Get the connector from Maven Central. If you use Gradle, Maven, or SBT to manage your Spark applications, you can add a dependency to your project to automatically get the connector and its dependencies and add them into your Spark application. See Getting the Connector from Maven Central below.

  • Download a precompiled assembly from the GitHub project's releases page. Once you have downloaded the connector, you must configure Spark to use it. See Deploy the Connector to the Spark Cluster below.

  • Clone the connector project and compile it. This option is useful if you want features or bugfixes that have not been released yet. See Compiling the Connector below.

See the Spark connector project on GitHub for detailed instructions on deploying and using the connector.

Getting the connector from maven central

The Vertica Spark connector is available from the Maven Central Repository. Using Maven Central is the easiest method of getting the connector if your build tool supports downloading dependencies from it.

If your Spark project is managed by Gradle, Maven, or SBT you can add the Spark connector to it by listing it as a dependency in its configuration file. You may also have to enable Maven Central if your build tool does not automatically do so.

For example, suppose you use SBT to manage your Spark application. In this case, the Maven Central repository is enabled by default. All you need to do is add the com.vertica.spark dependency to your build.sbt file.

See the com.vertica.spark page on the Maven Repository site for more information on the available Spark connector versions and dependency information for your build system.

Compiling the connector

You may choose to compile the Spark connector if you want to test new features or bugfixes that have not yet been released. Compiling the connector is necessary if you plan on contributing your own features.

To compile the connector, you need:

  • The SBT build tool. In order to compile the connector, you must install SBT and all of its dependencies (including a version of the Java SDK). See the SBT documentation for requirements and installation instructions.

  • git to clone the Spark connector V2 source from the GitHub.

As a quick overview, executing the following commands on a Linux command line will download the source and compile it into an assembly file:

$ git clone https://github.com/vertica/spark-connector.git
$ cd spark-connector/connector
$ sbt assembly

Once compiled, the connector is located at target/scala-n.n/spark-vertica-connector-assembly-x.x.jar . The n.n is the currently-supported version of Scala and x.x is the current version of the Spark connector.

See the CONTRIBUTING document in the connector's GitHub project for detailed requirements and compiling instructions.

Once you compile the connector, you must deploy it to your Spark cluster. See the next section for details.

Deploy the connector to the Spark cluster

If you downloaded the Spark connector from GitHub or compiled it yourself, you must deploy it to your Spark cluster before you can use it. Two options include copying it to a Spark node and including it in a spark-submit or spark-shell command or deploying it to the entire cluster and having it loaded automatically.

Loading the Spark connector from the command line

The quickest way to use the connector is to include it in the --jars argument when executing a spark-submit or spark-shell command. To be able to use the connector in the command line, you must first copy its assembly JAR to the Spark node on which you will run the commands. Then add the path to the assembly JAR file as part of the --jars command line argument.

For example, suppose you copied the assembly file to your current directory on a Spark node. Then you could load the connector when starting spark-shell with the command:

spark-shell --jars spark-vertica-connector-assembly-x.x.jar

You could also enable the connector when submitting a Spark job using spark-submit with the command:

spark-submit --jars spark-vertica-connector-assembly-x.x.jar

Configure Spark to automatically load the connector

You can configure your Spark cluster to automatically load the connector. Deploying the connector this way ensures that the Spark connector is available to all Spark applications on your cluster.

To have Spark automatically load the Spark connector:

  1. Copy the Spark connector's assembly JAR file to the same path on all of the nodes in your Spark cluster. For example, on Linux you could copy the spark-vertica-connector-assembly-x.x.jar file to the /usr/local/lib directory on every Spark node. The file must be in the same location on each node.

  2. In your Spark installation directories on each node in your cluster, edit the conf/spark-defaults.conf file to add or alter the following line:

    spark.jars /path_to_assembly/spark-vertica-connector-assembly-x.x.jar
    

    For example, if you copied the assembly JAR file to /usr/local/lib, you would add:

    spark.jars /usr/local/lib/spark-vertica-connector-assembly-x.x.jar
    
  3. Test your configuration by starting a Spark shell and entering the statement:

    import com.vertica.spark._
    

    If the statement completes successfully, then Spark was able to locate and load the Spark connector library correctly.

Prior connector versions

The Vertica Spark connector is an open source project available from GitHub. It is released on a separate schedule than the Vertica server.

Versions of Vertica prior to 11.0.2 distributed a closed-source proprietary version of the Spark connector. This legacy version is no longer supported, and is not distributed with the Vertica server install package.

The newer open source Spark connector differs from the old one in the following ways:

  • The new connector uses the Spark V2 API. Using this newer API makes it more future-proof than the legacy connector, which uses an older Spark API.

  • The primary class name has changed. Also, the primary class has several renamed configuration options and a few removed options. See Migrating from the legacy Vertica Spark connectorfor a list of these changes.

  • It supports more features than the older connector, such as Kerberos authentication and S3 intermediate storage.

  • It comes compiled as an assembly which contains supporting libraries such as the Vertica JDBC library.

  • It is distributed separate from the Vertica server. You can directly download it from the GitHub project. It is also available from the Maven Central Repository, making it easier for you to integrate it into your Gradle, Maven, or SBT workflows.

  • It is an open-source project that is not tied to the Vertica server release cycle. New features and bug fixes do not have to wait for a Vertica release. You can also contribute your own features and fixes.

If you have Spark applications that used the previous version of the Spark connector, see Migrating from the legacy Vertica Spark connectorfor instructions on how to update your Spark code to work with the new connector.

If you are still using a version of Spark earlier than 3.0, you must use the legacy version of the Spark connector. The new open source version is incompatible with older versions of Spark. You can get the older connector by downloading and extracting a Vertica installation package earlier than version 11.0.2.