This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Apache Spark integration

Apache Spark is an open-source, general-purpose cluster-computing framework.

Apache Spark is an open-source, general-purpose cluster-computing framework. See the Apache Spark website for more information. Vertica provides a Spark connector that you install into Spark that lets you transfer data between Vertica and Spark.

Using the connector, you can:

  • Copy large volumes of data from Spark DataFrames to Vertica tables. This feature lets you save your Spark analytics in Vertica.

  • Copy data from Vertica to Spark RDDs or DataFrames for use with Python, R, Scala and Java. The connector efficiently pushes down column selection and predicate filtering to Vertica before loading the data.

The connector lets you use Spark to preprocess data for Vertica and to use Vertica data in your Spark application. You can even round-trip data from Vertica to Spark—copy data from Vertica to Spark for analysis, and then save the results of that analysis back to Vertica.

See the spark-connector project on GitHub for more information about the connector.

How the connector works

The Spark connector is a library that you incorporate into your Spark applications to read data from and write data to Vertica. When transferring data, the connector uses an intermediate storage location as a buffer between the Vertica and Spark clusters. Using the intermediate storage location lets both Vertica and Spark use all of the nodes in their clusters to transfer data in parallel.

When transferring data from Vertica to Spark, the connector tells Vertica to write the data as Parquet files in the intermediate storage location. As part of this process, the connector pushes down the required columns and any Spark data filters into Vertica as SQL. This push down lets Vertica pre-filter the data so it only copies the data that Spark needs. Once Vertica finishes copying the data, the connector has Spark load it into DataFrames from the intermediate location.

When transferring data from Spark to Vertica, the process is reversed. Spark writes data into the intermediate storage location. It then connects to Vertica and runs a COPY statement to load the data from the intermediate location into a table.

Getting the connector

The Spark connector is an open source project. For the latest information on it, visit the spark-connector project on GitHub. To get an alert when there are updates to the connector, you can log into a GitHub account and click the Notifications button on any of the project's pages.

You have three options to get the Spark connector:

  • Get the connector from Maven Central. If you use Gradle, Maven, or SBT to manage your Spark applications, you can add a dependency to your project to automatically get the connector and its dependencies and add them into your Spark application. See Getting the Connector from Maven Central below.

  • Download a precompiled assembly from the GitHub project's releases page. Once you have downloaded the connector, you must configure Spark to use it. See Deploy the Connector to the Spark Cluster below.

  • Clone the connector project and compile it. This option is useful if you want features or bugfixes that have not been released yet. See Compiling the Connector below.

See the Spark connector project on GitHub for detailed instructions on deploying and using the connector.

Getting the connector from maven central

The Vertica Spark connector is available from the Maven Central Repository. Using Maven Central is the easiest method of getting the connector if your build tool supports downloading dependencies from it.

If your Spark project is managed by Gradle, Maven, or SBT you can add the Spark connector to it by listing it as a dependency in its configuration file. You may also have to enable Maven Central if your build tool does not automatically do so.

For example, suppose you use SBT to manage your Spark application. In this case, the Maven Central repository is enabled by default. All you need to do is add the com.vertica.spark dependency to your build.sbt file.

See the com.vertica.spark page on the Maven Repository site for more information on the available Spark connector versions and dependency information for your build system.

Compiling the connector

You may choose to compile the Spark connector if you want to test new features or bugfixes that have not yet been released. Compiling the connector is necessary if you plan on contributing your own features.

To compile the connector, you need:

  • The SBT build tool. In order to compile the connector, you must install SBT and all of its dependencies (including a version of the Java SDK). See the SBT documentation for requirements and installation instructions.

  • git to clone the Spark connector V2 source from the GitHub.

As a quick overview, executing the following commands on a Linux command line will download the source and compile it into an assembly file:

$ git clone https://github.com/vertica/spark-connector.git
$ cd spark-connector/connector
$ sbt assembly

Once compiled, the connector is located at target/scala-n.n/spark-vertica-connector-assembly-x.x.jar . The n.n is the currently-supported version of Scala and x.x is the current version of the Spark connector.

See the CONTRIBUTING document in the connector's GitHub project for detailed requirements and compiling instructions.

Once you compile the connector, you must deploy it to your Spark cluster. See the next section for details.

Deploy the connector to the Spark cluster

If you downloaded the Spark connector from GitHub or compiled it yourself, you must deploy it to your Spark cluster before you can use it. Two options include copying it to a Spark node and including it in a spark-submit or spark-shell command or deploying it to the entire cluster and having it loaded automatically.

Loading the Spark connector from the command line

The quickest way to use the connector is to include it in the --jars argument when executing a spark-submit or spark-shell command. To be able to use the connector in the command line, you must first copy its assembly JAR to the Spark node on which you will run the commands. Then add the path to the assembly JAR file as part of the --jars command line argument.

For example, suppose you copied the assembly file to your current directory on a Spark node. Then you could load the connector when starting spark-shell with the command:

spark-shell --jars spark-vertica-connector-assembly-x.x.jar

You could also enable the connector when submitting a Spark job using spark-submit with the command:

spark-submit --jars spark-vertica-connector-assembly-x.x.jar

Configure Spark to automatically load the connector

You can configure your Spark cluster to automatically load the connector. Deploying the connector this way ensures that the Spark connector is available to all Spark applications on your cluster.

To have Spark automatically load the Spark connector:

  1. Copy the Spark connector's assembly JAR file to the same path on all of the nodes in your Spark cluster. For example, on Linux you could copy the spark-vertica-connector-assembly-x.x.jar file to the /usr/local/lib directory on every Spark node. The file must be in the same location on each node.

  2. In your Spark installation directories on each node in your cluster, edit the conf/spark-defaults.conf file to add or alter the following line:

    spark.jars /path_to_assembly/spark-vertica-connector-assembly-x.x.jar
    

    For example, if you copied the assembly JAR file to /usr/local/lib, you would add:

    spark.jars /usr/local/lib/spark-vertica-connector-assembly-x.x.jar
    
  3. Test your configuration by starting a Spark shell and entering the statement:

    import com.vertica.spark._
    

    If the statement completes successfully, then Spark was able to locate and load the Spark connector library correctly.

Prior connector versions

The Vertica Spark connector is an open source project available from GitHub. It is released on a separate schedule than the Vertica server.

Versions of Vertica prior to 11.0.2 distributed a closed-source proprietary version of the Spark connector. This legacy version is no longer supported, and is not distributed with the Vertica server install package.

The newer open source Spark connector differs from the old one in the following ways:

  • The new connector uses the Spark V2 API. Using this newer API makes it more future-proof than the legacy connector, which uses an older Spark API.

  • The primary class name has changed. Also, the primary class has several renamed configuration options and a few removed options. See Migrating from the legacy Vertica Spark connectorfor a list of these changes.

  • It supports more features than the older connector, such as Kerberos authentication and S3 intermediate storage.

  • It comes compiled as an assembly which contains supporting libraries such as the Vertica JDBC library.

  • It is distributed separate from the Vertica server. You can directly download it from the GitHub project. It is also available from the Maven Central Repository, making it easier for you to integrate it into your Gradle, Maven, or SBT workflows.

  • It is an open-source project that is not tied to the Vertica server release cycle. New features and bug fixes do not have to wait for a Vertica release. You can also contribute your own features and fixes.

If you have Spark applications that used the previous version of the Spark connector, see Migrating from the legacy Vertica Spark connectorfor instructions on how to update your Spark code to work with the new connector.

If you are still using a version of Spark earlier than 3.0, you must use the legacy version of the Spark connector. The new open source version is incompatible with older versions of Spark. You can get the older connector by downloading and extracting a Vertica installation package earlier than version 11.0.2.

1 - Migrating from the legacy Vertica Spark connector

If you have existing Spark applications that used the closed-source Vertica Spark connector, you must update them to work with the newer open-source connector.

If you have existing Spark applications that used the closed-source Vertica Spark connector, you must update them to work with the newer open-source connector. The new connector offers new features, better performance, and is under ongoing development. The old connector has been removed from service.

Deployment changes between the old and connectors

The legacy connector was only distributed with the Vertica server install. The new connector is distributed via several channels giving you more ways to deploy it.

The new Spark connector is available from Maven Central. If you use Gradle, Maven, or SBT to manage your Spark applications, you may find it more convenient to deploy the Spark connector using a dependency rather than manually installing it on your Spark cluster. Integrating the connector into your Spark project as a dependency makes updating to newer versions of the connector easy—just update the required version in the dependency. See Getting the Connector from Maven Central for more information.

You can also download the precompiled connector assembly or build it from source. In this case, you must deploy the connector to your Spark cluster. The legacy connector depended on the Vertica JDBC driver and required that you separately include it. The new connector is an assembly that incorporates all of its dependencies, including the JDBC driver. You only need to deploy a single JAR file containing the Spark connector to your Spark cluster.

You can have Spark load both the legacy and new connector at the same time because the new connector's primary class name is different (see below). This renaming lets you add the new connector to your Spark configuration files without having to immediately port all of your Spark applications that use the legacy connector to the new API. You can just add the new assembly JAR file to spark-jars list in the spark-defaults.conf file.

API changes

There are several API changes from the legacy connector to the new connector require changes to your Spark application.

VerticaRDD class no longer supported

The legacy connector supported a class named VerticaRDD to load data from Vertica using the Spark resilient distributed dataset (RDD) feature. The new connector does not support this separate class. Instead, if you want to directly manipulate an RDD, access it through the DataFrame object you create using the DataSource API.

DefaultSource class renamed VerticaSource

The primary class in the legacy connector is named DataSource. In the new connector, this class has been renamed to VerticaSource. This renaming lets both connectors coexist, allowing you to gradually transition your Spark applications.

For your existing Spark application to use the new connector, you must change calls to the DataSource class to the VerticaSource class. For example, suppose your Spark application has this method call to read data from the legacy connector:

spark.read.format("com.vertica.spark.datasource.DefaultSource").options(opts).load()

Then to have it use the new connector, use this method call:

spark.read.format("com.vertica.spark.datasource.VerticaSource").options(opts).load()

Changed API options

In addition to renaming of the DataSource class to VerticaSource, some of the names for options for the primary connector class have changed. Other options are no longer supported. If you are porting a Spark application from the legacy to the new connector that uses one of the following options, you must update your code:

Legacy DataSource Option New VerticaSource Option Description
fileformat none The new connector does not support the fileformat option. The files that Vertica and Spark write to the intermediate storage location are always in parquet format.
hdfs_url staging_fs_url The location of the intermediate storage location that Vertica and Spark use to exchange data. Renamed to be more general, as the new connector supports storage platforms in addition to HDFS.
logging_level none The connector no longer supports setting a logging level. Instead, set the logging level in Spark.
numpartitions num_partitions The number of Spark partitions to use when reading data from Vertica.
target_table_ddl target_table_sql A SQL statement for Vertica to execute before loading data from Spark.
web_hdfs_url staging_fs_url Use the same option for a web HDFS URL as you would for an S3 or HDFS storage location.

In addition, the new connector has added options to support new features such as Kerberos authentication. For details on the connector's VerticaSource options API, see the Vertica Spark connector GitHub project.

Take advantage of new features

The new Vertica Spark connector offers new features that you may want to take advantage of.

Currently, the most notable new features are:

  • Kerberos authentication. This feature lets you configure the connector for passwordless connections to Vertica. See the Kerberos documentation in the Vertica Spark connector GitHub project for details of using this feature.

  • Support for using S3 for intermediate storage. This option lets you avoid having to set up a Hadoop cluster solely to host an HDFS storage location. See the S3 user guide at the connector GitHub site.