This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
File systems and object stores
Vertica supports access to several file systems and object stores in addition to the Linux file system.
Vertica supports access to several file systems and object stores in addition to the Linux file system. The reference pages in this section provide information on URI syntax, configuration parameters, and authentication.
Vertica accesses the file systems in this section in one of two ways:
-
If user-provided credentials are present, Vertica uses them to access the storage. Note that on HDFS, user credentials are always present because Vertica accesses HDFS using the Vertica user identity.
-
If user-provided credentials are not present, or if the UseServerIdentityOverUserIdentity configuration parameter is set, Vertica checks for a configured USER storage location. When access is managed through USER storage locations, Vertica uses the server credential to access the file system. For more information about USER storage locations, see CREATE LOCATION.
Not all file systems are supported in all contexts. See the documentation of specific features for the file systems those features support.
1 - Azure Blob Storage object store
Azure has several interfaces for accessing data.
Azure has several interfaces for accessing data. Vertica reads and always writes Block Blobs in Azure Storage. Vertica can read external data created using ADLS Gen2, and data that Vertica exports can be read using ADLS Gen2.
One of the following:
azb://
account
/
container
/
path
azb://[
account
@]
host
[:
port
]/
container
/
path
In the first version, a URI like 'azb://myaccount/mycontainer/path' treats the first token after the '//' as the account name. In the second version, you can specify account and must specify host explicitly.
The following rules apply to the second form:
- If
account
is not specified, the first label of the host is used. For example, if the URI is 'azb://myaccount.blob.core.windows.net/mycontainer/my/object', then 'myaccount' is used for account
.
- If
account
is not specified and host
has a single label and no port, the endpoint is host
.blob.core.windows.net
. Otherwise, the endpoint is the host and port specified in the URI.
The protocol (HTTP or HTTPS) is specified in the AzureStorageEndpointConfig configuration parameter.
Authentication
If you are using Azure managed identities, no further configuration in Vertica is needed. If your Azure storage uses multiple managed identities, you must tag the one to be used. Vertica looks for an Azure tag with a key of VerticaManagedIdentityClientId, the value of which must be the client_id attribute of the managed identity to be used. If you update the Azure tag, call AZURE_TOKEN_CACHE_CLEAR.
If you are not using managed identities, use the AzureStorageCredentials configuration parameter to provide credentials to Azure. If loading data, you can set the parameter at the session level. If using Eon Mode communal storage on Azure, you must set this configuration parameter at the database level.
In Azure you must also grant access to the containers for the identities used from Vertica.
Configuration parameters
The following database configuration parameters apply to the Azure blob file system. You can set parameters at different levels with the appropriate ALTER statement, such as ALTER SESSION...SET PARAMETER. Query the CONFIGURATION_PARAMETERS system table to determine what levels (node, session, user, database) are valid for a given parameter.
- AzureStorageCredentials
- Collection of JSON objects, each of which specifies connection credentials for one endpoint. This parameter takes precedence over Azure managed identities.
The collection must contain at least one object and may contain more. Each object must specify at least one of accountName
or blobEndpoint
, and at least one of accountKey
or sharedAccessSignature
.
accountName
: If not specified, uses the label of blobEndpoint
.
blobEndpoint
: Host name with optional port (host:port
). If not specified, uses account
.blob.core.windows.net
.
accountKey
: Access key for the account or endpoint.
sharedAccessSignature
: Access token for finer-grained access control, if being used by the Azure endpoint.
- AzureStorageEndpointConfig
- Collection of JSON objects, each of which specifies configuration elements for one endpoint. Each object must specify at least one of
accountName
or blobEndpoint
.
accountName
: If not specified, uses the label of blobEndpoint
.
blobEndpoint
: Host name with optional port (host:port
). If not specified, uses account
.blob.core.windows.net
.
protocol
: HTTPS (default) or HTTP.
isMultiAccountEndpoint
: true if the endpoint supports multiple accounts, false otherwise (default is false). To use multiple-account access, you must include the account name in the URI. If a URI path contains an account, this value is assumed to be true unless explicitly set to false.
Examples
The following examples use these values for the configuration parameters. AzureStorageCredentials contains sensitive information and is set at the session level in this example.
=> ALTER SESSION SET AzureStorageCredentials =
'[{"accountName": "myaccount", "accountKey": "REAL_KEY"},
{"accountName": "myaccount", "blobEndpoint": "localhost:8080", "accountKey": "TEST_KEY"}]';
=> ALTER DATABASE default SET AzureStorageEndpointConfig =
'[{"accountName": "myaccount", "blobEndpoint": "localhost:8080", "protocol": "http"}]';
The following example creates an external table using data from Azure. The URI specifies an account name of "myaccount".
=> CREATE EXTERNAL TABLE users (id INT, name VARCHAR(20))
AS COPY FROM 'azb://myaccount/mycontainer/my/object/*';
Vertica uses AzureStorageEndpointConfig and the account name to produce the following location for the files:
https://myaccount.blob.core.windows.net/mycontainer/my/object/*
Data is accessed using the REAL_KEY credential.
If the URI in the COPY statement is instead azb://myaccount.blob.core.windows.net/mycontainer/my/object
, then the resulting location is https://myaccount.blob.core.windows.net/mycontainer/my/object
, again using the REAL_KEY credential.
However, if the URI in the COPY statement is azb://myaccount@localhost:8080/mycontainer/my/object
, then the host and port specify a different endpoint: http://localhost:8080/myaccount/mycontainer/my/object
. This endpoint is configured to use a different credential, TEST_KEY.
2 - Google Cloud Storage (GCS) object store
File system using the Google Cloud Storage platform.
File system using the Google Cloud Storage platform.
gs://
bucket
/
path
Authentication
To access data in Google Cloud Storage (GCS) you must first do the following tasks:
-
Create a default project, obtain a developer key, and enable S3 interoperability mode as described in the GCS documentation.
-
Set the GCSAuth configuration parameter as in the following example.
=> ALTER SESSION SET GCSAuth='id:secret';
Configuration parameters
The following database configuration parameters apply to the GCS file system. You can set parameters at different levels with the appropriate ALTER statement, such as ALTER SESSION...SET PARAMETER. Query the CONFIGURATION_PARAMETERS system table to determine what levels (node, session, user, database) are valid for a given parameter. For information about all parameters related to GCS, see Google Cloud Storage parameters.
- GCSAuth
- An ID and secret key to authenticate to GCS. For extra security, do not store credentials in the database; instead, use ALTER SESSION...SET PARAMETER to set this value for the current session only.
- GCSEnableHttps
- Boolean, whether to use the HTTPS protocol when connecting to GCS, can be set only at the database level with ALTER DATABASE...SET PARAMETER.
Default: 1 (enabled)
- GCSEndpoint
- The connection endpoint address.
Default:storage.googleapis.com
Examples
The following example loads data from GCS:
=> ALTER SESSION SET GCSAuth='my_id:my_secret_key';
=> COPY t FROM 'gs://DataLake/clicks.parquet' PARQUET;
3 - HDFS file system
HDFS is the Hadoop Distributed File System.
HDFS is the Hadoop Distributed File System. You can use the webhdfs
and swebhdfs
schemes to access data through the WebHDFS service. Vertica also supports the hdfs
scheme, which by default uses the deprecated LibHDFS++ package. To treat hdfs
URIs as if they were webhdfs
URIs, set the HDFSUseWebHDFS configuration parameter to 1 (enabled).
If you specify a webhdfs
URI but the Hadoop HTTP policy (dfs.http.policy
) is set to HTTPS_ONLY, Vertica automatically uses swebhdfs
instead.
If you use LibHDFS++, the WebHDFS service must still be available because Vertica falls back to WebHDFS for operations not supported by LibHDFS++.
Deprecated
Support for LibHDFS++ is deprecated. In the future, HDFSUseWebHDFS will be enabled in all cases and hdfs
URIs will be equivalent to webhdfs
URIs.
URIs in the webhdfs
, swebhdfs
, and hdfs
schemes all have two formats, depending on whether you specify a name service or the host and port of a name node:
[[s]web]hdfs://[
nameservice
]/
path
[[s]web]hdfs://
namenode-host:port
/
path
Characters may be URL-encoded (%NN where NN is a two-digit hexadecimal number) but are not required to be, except that the '%' character must be encoded.
To use the default name service specified in the HDFS configuration files, omit nameservice
. Use this shorthand only for reading external data, not for creating a storage location.
Always specify a name service or host explicitly when using Vertica with more than one HDFS cluster. The name service or host name must be globally unique. Using [web]hdfs:///
could produce unexpected results because Vertica uses the first value of fs.defaultFS
that it finds.
Authentication
Vertica can use Kerberos authentication with Cloudera or Hortonworks HDFS clusters. See Accessing kerberized HDFS data.
For loading and exporting data, Vertica can access HDFS clusters protected by mTLS through the swebhdfs
scheme. You must create a certificate and key and set the WebhdfsClientCertConf configuration parameter.
You can use CREATE KEY and CREATE CERTIFICATE to create temporary, session-scoped values if you specify the TEMPORARY keyword. Temporary keys and certificates are stored in memory, not on disk.
The WebhdfsClientCertConf configuration parameter holds client credentials for one or more HDFS clusters. The value is a JSON string listing name services or authorities and their corresponding keys. You can set the configuration parameter at the session or database level. Setting the parameter at the database level has the following additional requirements:
The following example shows how to use mTLS. The key and certificate values themselves are not shown, just the beginning and end markers:
=> CREATE TEMPORARY KEY client_key TYPE 'RSA'
AS '-----BEGIN PRIVATE KEY-----...-----END PRIVATE KEY-----';
-> CREATE TEMPORARY CERTIFICATE client_cert
AS '-----BEGIN CERTIFICATE-----...-----END CERTIFICATE-----' key client_key;
=> ALTER SESSION SET WebhdfsClientCertConf =
'[{"authority": "my.hdfs.namenode1:50088", "certName": "client_cert"}]';
=> COPY people FROM 'swebhdfs://my.hdfs.namenode1:50088/path/to/file/1.txt';
Rows Loaded
-------------
1
(1 row)
To configure access to more than one HDFS cluster, define the keys and certificates and then include one object per cluster in the value of WebhdfsClientCertConf:
=> ALTER SESSION SET WebhdfsClientCertConf =
'[{"authority" : "my.authority.com:50070", "certName" : "myCert"},
{"nameservice" : "prod", "certName" : "prodCert"}]';
Configuration parameters
The following database configuration parameters apply to the HDFS file system. You can set parameters at different levels with the appropriate ALTER statement, such as ALTER SESSION...SET PARAMETER. Query the CONFIGURATION_PARAMETERS system table to determine what levels (node, session, user, database) are valid for a given parameter. For information about all parameters related to Hadoop, see Apache Hadoop parameters.
- EnableHDFSBlockInfoCache
- Boolean, whether to distribute block location metadata collected during planning on the initiator to all database nodes for execution, reducing name node contention. Disabled by default.
- HadoopConfDir
- Directory path containing the XML configuration files copied from Hadoop. The same path must be valid on every Vertica node. The files are accessed by the Linux user under which the Vertica server process runs.
- HadoopImpersonationConfig
- Session parameter specifying the delegation token or Hadoop user for HDFS access. See HadoopImpersonationConfig format for information about the value of this parameter and Proxy users and delegation tokens for more general context.
- HDFSUseWebHDFS
- Boolean. If true, URIs in the
hdfs
scheme are treated as if they were in the webhdfs
scheme. If false, Vertica uses LibHDFS++ where possible, though some operations can still use WebHDFS if not supported by LibHDFS++.
- WebhdfsClientCertConf
- mTLS configurations for accessing one or more WebHDFS servers, a JSON string. Each object must specify either a
nameservice
or authority
field and a certName
field. See Authentication.
Configuration files
The path specified in HadoopConfDir must include a directory containing the files listed in the following table. Vertica reads these files at database start time. If you do not set a value, Vertica looks for the files in /etc/hadoop/conf.
If a property is not defined, Vertica uses the defaults shown in the table. If no default is specified for a property, the configuration files must specify a value.
File |
Properties |
Default |
core-site.xml |
fs.defaultFS |
none |
|
(for doAs users:) hadoop.proxyuser.username .users |
none |
|
(for doAs users:) hadoop.proxyuser.username .hosts |
none |
hdfs-site.xml |
dfs.client.failover.max.attempts |
15 |
|
dfs.client.failover.sleep.base.millis |
500 |
|
dfs.client.failover.sleep.max.millis |
15000 |
|
(For HA NN:) dfs.nameservices |
none |
|
(WebHDFS:) dfs.namenode.http-address or dfs.namenode.https-address |
none |
|
(WebHDFS:) dfs.datanode.http.address or dfs.datanode.https.address |
none |
|
(WebHDFS:) dfs.http.policy |
HTTP_ONLY |
If using High Availability (HA) Name Nodes, the individual name nodes must also be defined in hdfs-site.xml.
Note
If you are using Eon Mode with communal storage on HDFS, then if you set dfs.encrypt.data.transfer you must use the swebhdfs
scheme for communal storage.
To verify that Vertica can find configuration files in HadoopConfDir, use the VERIFY_HADOOP_CONF_DIR function.
To test access through the hdfs
scheme, use the HDFS_CLUSTER_CONFIG_CHECK function.
For more information about testing your configuration, see Verifying HDFS configuration.
To reread the configuration files, use the CLEAR_HDFS_CACHES function.
Name nodes and name services
You can access HDFS data using the default name node by not specifying a name node or name service:
=> COPY users FROM 'webhdfs:///data/users.csv';
Vertica uses the fs.defaultFS
Hadoop configuration parameter to find the name node. (It then uses that name node to locate the data.) You can instead specify a host and port explicitly using the following format:
webhdfs://nn-host:nn-port/
The specified host is the name node, not an individual data node. If you are using High Availability (HA) Name Nodes you should not use an explicit host because high availability is provided through name services instead.
If the HDFS cluster uses High Availability Name Nodes or defines name services, use the name service instead of the host and port, in the format webhdfs://nameservice/
. The name service you specify must be defined in hdfs-site.xml
.
The following example shows how you can use a name service, hadoopNS:
=> CREATE EXTERNAL TABLE users (id INT, name VARCHAR(20))
AS COPY FROM 'webhdfs://hadoopNS/data/users.csv';
If you are using Vertica to access data from more than one HDFS cluster, always use explicit name services or hosts in the URL. Using the ///
shorthand could produce unexpected results because Vertica uses the first value of fs.defaultFS
that it finds. To access multiple HDFS clusters, you must use host and service names that are globally unique. See Configuring HDFS access for more information.
4 - S3 object store
File systems using the S3 protocol, including AWS, Pure Storage, and MinIO.
File systems using the S3 protocol, including AWS, Pure Storage, and MinIO.
s3://
bucket
/
path
For AWS, specify the region in the AWSRegion configuration parameter, not the URI. If the region is incorrect, you might experience a delay before the load fails because Vertica retries several times before giving up. The default region is us-east-1.
Authentication
For AWS:
-
To access S3 you must create an IAM role and grant that role permission to access your S3 resources.
-
By default, bucket access is restricted to the communal storage bucket. Use an AWS access key to load data from non-communal storage buckets.
-
Either set the AWSAuth configuration parameter to provide credentials or create a USER storage location for the S3 path (see CREATE LOCATION) and grant users access.
-
You can use AWS STS temporary session tokens to load data. Because they are session tokens, do not use them for access to storage locations.
-
You can configure S3 buckets individually with the per-bucket parameters S3BucketConfig and S3BucketCredentials. For details, see Per-bucket S3 configurations.
Configuration parameters
The following database configuration parameters apply to the S3 file system. You can set parameters at different levels with the appropriate ALTER statement, such as ALTER SESSION...SET PARAMETER. Query the CONFIGURATION_PARAMETERS system table to determine what levels (node, session, user, database) are valid for a given parameter. For information about all parameters related to S3, see S3 parameters.
- AWSAuth
- An ID and secret key for authentication. AWS calls these AccessKeyID and SecretAccessKey. For extra security, do not store credentials in the database; use ALTER SESSION...SET PARAMETER to set this value for the current session only.
To use admintools create_db
or revive_db
for Eon Mode on-premises, set this parameter in the auth_params.conf
configuration file.
- AWSCAFile
- The file name of the TLS server certificate bundle to use. You must set a value when installing a CA certificate on a SUSE Linux Enterprise Server.
- AWSCAPath
- The path Vertica uses to look up TLS server certificates. You must set a value when installing a CA certificate on a SUSE Linux Enterprise Server.
- AWSEnableHttps
- Boolean, whether to use the HTTPS protocol when connecting to S3. Can be set only at the database level with ALTER DATABASE...SET PARAMETER.
Default: 1 (enabled)
- AWSEndpoint
- String, the endpoint host for all S3 URLs, set as follows:
-
AWS: hostname_or_IP
:port
. Do not include the scheme (http(s)).
-
AWS with a FIPS-compliant S3 Endpoint: Hostname of a FIPS-compliant S3 endpoint. You must also enable S3EnableVirtualAddressing.
-
On-premises/Pure: IP address of the Pure Storage server.
If not set, Vertica uses virtual-hosted request URLs.
To use admintools create_db
or revive_db
for Eon Mode on-premises, set this parameter in the auth_params.conf
configuration file.
Default: 's3.amazonaws.com'
- AWSLogLevel
- The log level, one of the following:
-
OFF
-
FATAL
-
ERROR
-
WARN
-
INFO
-
DEBUG
-
TRACE
Default: ERROR
- AWSRegion
- The AWS region containing the S3 bucket from which to read files. This parameter can only be configured with one region at a time. If you need to access buckets in multiple regions, change the parameter each time you change regions.
Failure to set the correct region can lead to a delay before queries fail.
Default: 'us-east-1'
- AWSSessionToken
- A temporary security token generated by running the
get-session-token
command, used to configure multi-factor authentication.
Note
If you use session tokens, you must set
all parameters at the session level, even if some of them are set at the database level. Use
ALTER SESSION to set session parameters.
- S3BucketConfig
- Contains S3 bucket configuration information as a JSON object with the following properties. Some properties have equivalent database-level parameters (shown in parentheses). If both the database-level parameter and its equivalent in S3BucketConfig are set, the value in S3BucketConfig takes precedence.
Properties:
-
bucket
: The name of the bucket
-
region
: The name of the region (AWSRegion)
-
protocol
: Scheme, one of http
or https
-
endpoint
: The endpoint URL or IP address (AWSEndpoint)
-
enableVirtualAddressing
: Boolean, whether to rewrite the S3 URL to use a virtual hosted path (S3BucketCredentials)
-
requesterPays
: Boolean, whether requester (instead of bucket owner) pays the cost of accessing data on the bucket; must be set in order to access S3 buckets configured as Requester Pays buckets. By setting this property to true, you are accepting the charges for accessing data. If not specified, the default value is false.
- S3BucketCredentials
- Credentials for accessing an S3 bucket, a JSON object. Each property has an equivalent database-level parameter (shown in parentheses). If both the database-level parameter and its equivalent in S3BucketCredentials are set, the value in S3BucketCredentials takes precedence.
Properties:
-
bucket
: The name of the bucket
-
accessKey
: The access key for the bucket (the ID in AWSAuth)
-
secretAccessKey
: The secret access key for the bucket (the secret
in AWSAuth)
-
sessionToken
: The session token, only used when S3BucketCredentials is set at the session level (AWSSessionToken)
This parameter is only visible to superusers. Users can set this parameter at the session level with ALTER SESSION.
- S3EnableVirtualAddressing
- Boolean, whether to rewrite S3 URLs to use virtual-hosted paths (disabled by default). This configuration setting takes effect only when you have specified a value for AWSEndpoint.
If you set AWSEndpoint to a FIPS-compliant S3 endpoint, you must enable S3EnableVirtualAddressing.
The value of this parameter does not affect how you specify S3 paths.
Note
As of September 30, 2020, AWS requires virtual address paths for newly created buckets.
- S3RequesterPays
- Boolean, whether requester (instead of bucket owner) pays the cost of accessing data on the bucket. When true, the bucket owner is only responsible for paying the cost of storing the data, rather than all costs associated with the bucket; must be set in order to access S3 buckets configured as Requester Pays buckets. If not specified, the default value is false.
- AWSStreamingConnectionPercentage
- In Eon Mode, the number of connections to the communal storage to use for streaming reads. In a cloud environment, this setting helps prevent streaming data from using up all available file handles. This setting is unnecessary when using on-premises object stores because of their lower latency.
Examples
The following example sets a database-wide AWS region and credentials:
=> ALTER DATABASE DEFAULT SET AWSRegion='us-west-1';
=> ALTER DATABASE DEFAULT SET AWSAuth = 'myaccesskeyid123456:mysecretaccesskey123456789012345678901234';
The following example loads data from S3. You can use a glob if all files in the glob can be loaded together. In the following example, AWS_DataLake contains only ORC files.
=> COPY t FROM 's3://datalake/*' ORC;
You can specify a list of comma-separated S3 buckets as in the following example. All buckets must be in the same region. To load from more than one region, use separate COPY statements and change the value of AWSRegion between calls.
=> COPY t FROM 's3://AWS_Data_1/sales.parquet', 's3://AWS_Data_2/sales.parquet' PARQUET;
The following example creates a user storage location and a role, so that users without their own S3 credentials can read data from S3 using the server credential.
--- set database-level credential (once):
=> ALTER DATABASE DEFAULT SET AWSAuth = 'myaccesskeyid123456:mysecretaccesskey123456789012345678901234';
=> CREATE LOCATION 's3://datalake' SHARED USAGE 'USER' LABEL 's3user';
=> CREATE ROLE ExtUsers;
--- Assign users to this role using GRANT (Role).
=> GRANT READ ON LOCATION 's3://datalake' TO ExtUsers;
The configuration properties for a given bucket may differ based on its type. The following S3BucketConfig setting is for an AWS bucket (AWSBucket
) and a Pure Storage bucket (PureStorageBucket
). AWSBucket
doesn't specify an endpoint, so Vertica uses the AWSEndpoint configuration parameter, which defaults to s3.amazonaws.com
:
ALTER DATABASE DEFAULT SET S3BucketConfig=
'[
{
"bucket": "AWSBucket",
"region": "us-east-2",
"protocol": "https",
"requesterPays": true
},
{
"bucket": "PureStorageBucket",
"endpoint": "pure.mycorp.net:1234",
"protocol": "http",
"enableVirtualAddressing": false
}
]';
The following example sets S3BucketCredentials for these two buckets:
ALTER SESSION SET S3BucketCredentials='
[
{
"bucket": "AWSBucket",
"accessKey": "<AK0>",
"secretAccessKey": "<SAK0>",
"sessionToken": "1234567890"
},
{
"bucket": "PureStorageBucket",
"accessKey": "<AK1>",
"secretAccessKey": "<SAK1>"
}
]';
The following example sets an STS temporary session token. Vertica uses the session token to access S3 with the specified credentials and bypasses checking for a USER storage location.
$ aws sts get-session-token
{
"Credentials": {
"AccessKeyId": "ASIAJZQNDVS727EHDHOQ",
"SecretAccessKey": "F+xnpkHbst6UPorlLGj/ilJhO5J2n3Yo7Mp4vYvd",
"SessionToken": "FQoDYXdzEKv//////////wEaDMWKxakEkCyuDH0UjyKsAe6/3REgW5VbWtpuYyVvSnEK1jzGPHi/jPOPNT7Kd+ftSnD3qdaQ7j28SUW9YYbD50lcXikz/HPlusPuX9sAJJb7w5oiwdg+ZasIS/+ejFgCzLeNE3kDAzLxKKsunvwuo7EhTTyqmlLkLtIWu9zFykzrR+3Tl76X7EUMOaoL31HOYsVEL5d9I9KInF0gE12ZB1yN16MsQVxpSCavOFHQsj/05zbxOQ4o0erY1gU=",
"Expiration": "2018-07-18T05:56:33Z"
}
}
$ vsql
=> ALTER SESSION SET AWSAuth = 'ASIAJZQNDVS727EHDHOQ:F+xnpkHbst6UPorlLGj/ilJhO5J2n3Yo7Mp4vYvd';
=> ALTER SESSION SET AWSSessionToken = 'FQoDYXdzEKv//////////wEaDMWKxakEkCyuDH0UjyKsAe6/3REgW5VbWtpuYyVvSnEK1jzGPHi/jPOPNT7Kd+ftSnD3qdaQ7j28SUW9YYbD50lcXikz/HPlusPuX9sAJJb7w5oiwdg+ZasIS/+ejFgCzLeNE3kDAzLxKKsunvwuo7EhTTyqmlLkLtIWu9zFykzrR+3Tl76X7EUMOaoL31HOYsVEL5d9I9KInF0gE12ZB1yN16MsQVxpSCavOFHQsj/05zbxOQ4o0erY1gU=';
See also
Per-Bucket S3 Configurations
4.1 - Per-bucket S3 configurations
You can manage configurations and credentials for individual buckets with the configuration parameters S3BucketConfig and S3BucketCredentials.
You can manage configurations and credentials for individual buckets with the configuration parameters S3BucketConfig and S3BucketCredentials. These parameters take a JSON object, whose respective properties behave like AWSAuth and AWSEndpoint.
For example, you can create a different configuration for each of your S3 buckets by setting S3BucketConfig at the database level with ALTER DATABASE. The following S3BucketConfig specifies all possible properties:
=> ALTER DATABASE DEFAULT SET S3BucketConfig='
[
{
"bucket": "exampleAWS",
"region": "us-east-2",
"protocol": "https",
"requesterPays": true
},
{
"bucket": "examplePureStorage",
"endpoint": "pure.mycorp.net:1234",
"protocol": "http",
"enableVirtualAddressing": false
}
]';
Users can then access a bucket by setting S3BucketCredentials at the session level with ALTER SESSION. The following S3BucketCredentials specifies all possible properties and authenticates to both exampleAWS
and examplePureStorage
simultaneously.
=> ALTER SESSION SET S3BucketCredentials='
[
{
"bucket": "exampleAWS",
"accessKey": "<AK0>",
"secretAccessKey": "<SAK0>",
"sessionToken": "1234567890"
},
{
"bucket": "examplePureStorage",
"accessKey": "<AK1>",
"secretAccessKey": "<SAK1>",
}
]';
Recommended usage
The recommended usage is as follows:
-
Define in your S3 storage system one set of credentials per principal, per storage system.
-
It is often most convenient to set S3BucketConfig once at the database level and have users authenticate by setting S3BucketCredentials at the session level.
-
To access buckets outside those configured at the database level, set both S3BucketConfig and S3BucketCredentials at the session level.
If you cannot define credentials for your S3 storage, you can set S3BucketCredentials or AWSAuth at the database level with ALTER DATABASE, but this comes with certain drawbacks:
-
Storing credentials statically in another location (in this case, in the Vertica catalog) always incurs additional risk.
-
This increases overhead for the dbadmin, who needs to create user storage locations and grant access to each user or role.
-
Users share one set of credentials, increasing the potential impact if the credentials are compromised.
Note
If you set AWSEndpoint to a non-Amazon S3 storage system like Pure Storage or MinIO and you want to configure S3BucketConfig for real Amazon S3, the following requirements apply:
-
If your real Amazon S3 region is not us-east-1
(the default), you must specify the region
.
-
Set endpoint
to an empty string (""
).
ALTER DATABASE DEFAULT SET S3BucketConfig='
[
{
"bucket": "additionalAWSBucket",
"region": "us-east-2",
"endpoint": ""
}
]';
Precedence of per-bucket and standard parameters
Vertica uses the following rules to determine the effective set of properties for an S3 connection:
-
If set, S3BucketCredentials takes priority over its standard parameters. S3BucketCredentials is checked first at the session level and then at the database level.
-
The level/source of the S3 credential parameters determines the source of the S3 configuration parameters:
-
If credentials come from the session level, then the configuration can come from either the session or database level (with the session level taking priority).
-
If your credentials come from the database level, then the configuration can only come from the database level.
-
If S3BucketConfig is set, it takes priority over its standard parameters. If an S3BucketConfig property isn't specified, Vertica falls back to the missing property's equivalent parameter. For example, if S3BucketConfig specifies every property except protocol
, Vertica falls back to the standard parameter AWSEnableHttps
.
Examples
Using per-bucket parameters
This example configures a real Amazon S3 bucket AWSBucket
and a Pure Storage bucket PureStorageBucket
with S3BucketConfig.
AWSBucket
does not specify an endpoint
or protocol, so Vertica falls back to AWSEndpoint
(defaults to s3.amazonaws.com
) and AWSEnableHttps (defaults to 1
).
In this example environment, access to the PureStorageBucket
is over a secure network, so HTTPS is disabled:
ALTER DATABASE DEFAULT SET S3BucketConfig='
[
{
"bucket": "AWSBucket",
"region": "us-east-2"
},
{
"bucket": "PureStorageBucket",
"endpoint": "pure.mycorp.net:1234",
"protocol": "http",
"enableVirtualAddressing": false
}
]';
Bob can then set S3BucketCredentials at the session level to authenticate to AWSBucket
:
=> ALTER SESSION SET S3BucketCredentials='
[
{
"bucket": "AWSBucket",
"accessKey": "<AK0>",
"secretAccessKey": "<SAK0>",
"sessionToken": "1234567890"
}
]';
Similarly, Alice can authenticate to PureStorageBucket
:
=> ALTER SESSION SET S3BucketCredentials='
[
{
"bucket": "PureStorageBucket",
"accessKey": "<AK1>",
"secretAccessKey": "<SAK1>"
}
]';
Charlie provides credentials for both AWSBucket
and PureStorageBucket
and authenticates to them simultaneously. This allows him to perform cross-endpoint joins, export from one bucket to another, etc.
=> ALTER SESSION SET S3BucketCredentials='
[
{
"bucket": "AWSBucket",
"accessKey": "<AK0>",
"secretAccessKey": "<SAK0>",
"sessionToken": "1234567890"
},
{
"bucket": "PureStorageBucket",
"accessKey": "<AK1>",
"secretAccessKey": "<SAK1>"
}
]';
Non-amazon S3 storage with AWSEndpoint and S3BucketConfig
If AWSEndpoint is set to a non-Amazon S3 bucket like Pure Storage or MinIO and you want to configure S3BucketConfig for a real Amazon S3 bucket, the following requirements apply:
-
If your real Amazon S3 region is not us-east-1
(the default), you must specify the region
.
-
Set endpoint
to an empty string (""
).
In this example, AWSEndpoint is set to a Pure Storage bucket.
=> ALTER DATABASE DEFAULT SET AWSEndpoint='pure.mycorp.net:1234';
To configure S3BucketConfig for a real Amazon S3 bucket realAmazonS3Bucket
in region "us-east-2
":
=> ALTER DATABASE DEFAULT SET S3BucketConfig='
[
{
"bucket": "realAmazonS3Bucket",
"region": "us-east-2",
"endpoint": ""
},
]';