Migrating from the legacy Vertica Spark connector
If you have existing Spark applications that used the closed-source Vertica Spark connector, you must update them to work with the newer open-source connector. The new connector offers new features, better performance, and is under ongoing development. The old connector has been removed from service.
Deployment changes between the old and connectors
The legacy connector was only distributed with the Vertica server install. The new connector is distributed via several channels giving you more ways to deploy it.
The new Spark connector is available from Maven Central. If you use Gradle, Maven, or SBT to manage your Spark applications, you may find it more convenient to deploy the Spark connector using a dependency rather than manually installing it on your Spark cluster. Integrating the connector into your Spark project as a dependency makes updating to newer versions of the connector easy—just update the required version in the dependency. See Getting the Connector from Maven Central for more information.
You can also download the precompiled connector assembly or build it from source. In this case, you must deploy the connector to your Spark cluster. The legacy connector depended on the Vertica JDBC driver and required that you separately include it. The new connector is an assembly that incorporates all of its dependencies, including the JDBC driver. You only need to deploy a single JAR file containing the Spark connector to your Spark cluster.
You can have Spark load both the legacy and new connector at the same time because the new connector's primary class name is different (see below). This renaming lets you add the new connector to your Spark configuration files without having to immediately port all of your Spark applications that use the legacy connector to the new API. You can just add the new assembly JAR file to spark-jars list in the spark-defaults.conf
file.
API changes
There are several API changes from the legacy connector to the new connector require changes to your Spark application.
VerticaRDD class no longer supported
The legacy connector supported a class named VerticaRDD
to load data from Vertica using the Spark resilient distributed dataset (RDD) feature. The new connector does not support this separate class. Instead, if you want to directly manipulate an RDD, access it through the DataFrame
object you create using the DataSource
API.
DefaultSource class renamed VerticaSource
The primary class in the legacy connector is named DataSource
. In the new connector, this class has been renamed to VerticaSource
. This renaming lets both connectors coexist, allowing you to gradually transition your Spark applications.
For your existing Spark application to use the new connector, you must change calls to the DataSource
class to the VerticaSource
class. For example, suppose your Spark application has this method call to read data from the legacy connector:
spark.read.format("com.vertica.spark.datasource.DefaultSource").options(opts).load()
Then to have it use the new connector, use this method call:
spark.read.format("com.vertica.spark.datasource.VerticaSource").options(opts).load()
Changed API options
In addition to renaming of the DataSource
class to VerticaSource
, some of the names for options for the primary connector class have changed. Other options are no longer supported. If you are porting a Spark application from the legacy to the new connector that uses one of the following options, you must update your code:
Legacy DataSource Option | New VerticaSource Option | Description |
---|---|---|
fileformat |
none | The new connector does not support the fileformat option. The files that Vertica and Spark write to the intermediate storage location are always in parquet format. |
hdfs_url |
staging_fs_url |
The location of the intermediate storage location that Vertica and Spark use to exchange data. Renamed to be more general, as the new connector supports storage platforms in addition to HDFS. |
logging_level |
none | The connector no longer supports setting a logging level. Instead, set the logging level in Spark. |
numpartitions |
num_partitions |
The number of Spark partitions to use when reading data from Vertica. |
target_table_ddl |
target_table_sql |
A SQL statement for Vertica to execute before loading data from Spark. |
web_hdfs_url |
staging_fs_url |
Use the same option for a web HDFS URL as you would for an S3 or HDFS storage location. |
In addition, the new connector has added options to support new features such as Kerberos authentication. For details on the connector's VerticaSource options API, see the Vertica Spark connector GitHub project.
Take advantage of new features
The new Vertica Spark connector offers new features that you may want to take advantage of.
Currently, the most notable new features are:
-
Kerberos authentication. This feature lets you configure the connector for passwordless connections to Vertica. See the Kerberos documentation in the Vertica Spark connector GitHub project for details of using this feature.
-
Support for using S3 for intermediate storage. This option lets you avoid having to set up a Hadoop cluster solely to host an HDFS storage location. See the S3 user guide at the connector GitHub site.