This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

S3 object store

File systems using the S3 protocol, including AWS, Pure Storage, and MinIO.

File systems using the S3 protocol, including AWS, Pure Storage, and MinIO.

URI format

s3://bucket/path

For AWS, specify the region using the AWSRegion configuration parameter, not the URI. If the region is incorrect, you might experience a delay before the load fails because Vertica retries several times before giving up. The default region is us-east-1.

Authentication

For AWS:

  • To access S3 you must create an IAM role and grant that role permission to access your S3 resources.

  • By default, bucket access is restricted to the communal storage bucket. Use an AWS access key to load data from non-communal storage buckets.

  • Either set the AWSAuth configuration parameter to provide credentials or create a USER storage location for the S3 path (see CREATE LOCATION) and grant users access.

  • You can use AWS STS temporary session tokens to load data. Because they are session tokens, do not use them for access to storage locations.

  • You can configure S3 buckets individually with the per-bucket parameters S3BucketConfig and S3BucketCredentials. For details, see Per-bucket S3 configurations.

Configuration parameters

The following database configuration parameters apply to the S3 file system. You can set parameters at different levels with the appropriate ALTER statement, such as ALTER SESSION...SET PARAMETER. Query the CONFIGURATION_PARAMETERS system table to determine what levels (node, session, user, database) are valid for a given parameter.

You can configure individual buckets using the S3BucketConfig and S3BucketCredentials parameters instead of the global parameters.

For external tables using highly partitioned data in an object store, see the ObjectStoreGlobStrategy configuration parameter and Partitions on Object Stores.

The following descriptions are summaries. For details about all parameters specific to S3, see S3 parameters.

AWSAuth
An ID and secret key for authentication. AWS calls these AccessKeyID and SecretAccessKey. For extra security, do not store credentials in the database; use ALTER SESSION...SET PARAMETER to set this value for the current session only.
AWSCAFile
The file name of the TLS server certificate bundle to use. You must set a value when installing a CA certificate on a SUSE Linux Enterprise Server.
AWSCAPath
The path Vertica uses to look up TLS server certificates. You must set a value when installing a CA certificate on a SUSE Linux Enterprise Server.
AWSEnableHttps
Boolean, whether to use the HTTPS protocol when connecting to S3. Can be set only at the database level. You can set the prototol for individual buckets using S3BucketConfig.

Default: 1 (enabled)

AWSEndpoint
String, the endpoint host for all S3 URLs, set as follows:
  • AWS: hostname_or_IP:port. Do not include the scheme (http(s)).

  • AWS with a FIPS-compliant S3 Endpoint: Hostname of a FIPS-compliant S3 endpoint. You must also enable S3EnableVirtualAddressing.

  • On-premises/Pure: IP address of the Pure Storage server.

If not set, Vertica uses virtual-hosted request URLs.

Default: 's3.amazonaws.com'

AWSLogLevel
The log level, one of: OFF, FATAL, ERROR, WARN, INFO, DEBUG, or TRACE.

Default: ERROR

AWSRegion
The AWS region containing the S3 bucket from which to read files. This parameter can only be configured with one region at a time. Failure to set the correct region can lead to a delay before queries fail.

Default: 'us-east-1'

AWSSessionToken
A temporary security token generated by running the get-session-token command, used to configure multi-factor authentication.
AWSStreamingConnectionPercentage
In Eon Mode, the number of connections to the communal storage to use for streaming reads. In a cloud environment, this setting helps prevent streaming data from using up all available file handles. This setting is unnecessary when using on-premises object stores because of their lower latency.
S3BucketConfig
A JSON array of objects specifying per-bucket configuration overrides. Each property other than the bucket name has a corresponding configuration parameter (shown in parentheses). If both the database-level parameter and its equivalent in S3BucketConfig are set, the value in S3BucketConfig takes precedence.

Properties:

  • bucket: Bucket name

  • region (AWSRegion)

  • protocol (AWSEnableHttp): Connection protocol, one of http or https

  • endpoint (AWSEndpoint)

  • enableVirtualAddressing (S3BucketCredentials): Boolean, whether to rewrite the S3 URL to use a virtual hosted path

  • requesterPays (S3RequesterPays)

  • serverSideEncryption (S3ServerSideEncryption)

  • sseCustomerAlgorithm (S3SseCustomerAlgorithm)

  • sseCustomerKey (S3SseCustomerKey)

  • sseKmsKeyId (S3SseKmsKeyId)

  • proxy (S3Proxy)

S3BucketCredentials
A JSON object specifying per-bucket credentials. Each property other than the bucket name has a corresponding configuration parameter. If both the database-level parameter and its equivalent in S3BucketCredentials are set, the value in S3BucketCredentials takes precedence.

Properties:

  • bucket: Bucket name

  • accessKey: Access key for the bucket (the ID in AWSAuth)

  • secretAccessKey: Secret access key for the bucket (the secret in AWSAuth)

  • sessionToken: Session token, only used when S3BucketCredentials is set at the session level (AWSSessionToken)

This parameter is only visible to superusers. Users can set this parameter at the session level with ALTER SESSION.

S3EnableVirtualAddressing
Boolean, whether to rewrite S3 URLs to use virtual-hosted paths (disabled by default). This configuration setting takes effect only when you have specified a value for AWSEndpoint.

If you set AWSEndpoint to a FIPS-compliant S3 endpoint, you must enable S3EnableVirtualAddressing.

The value of this parameter does not affect how you specify S3 paths.

S3Proxy
HTTP(S) proxy settings, if needed, a string in the following format: http[s]://[user:password@]host[:port].
S3RequesterPays
Boolean, whether requester (instead of bucket owner) pays the cost of accessing data on the bucket.
S3ServerSideEncryption
String, encryption algorithm to use when reading or writing to S3. Supported values are AES256 (for SSE-S3), aws:kms (for SSE-KMS), and an empty string (for no encryption). See Server-Side Encryption.

Default: "" (no encryption)

S3SseCustomerAlgorithm
String, the encryption algorithm to use when reading or writing to S3 using SSE-C encryption. The only supported values are AES256 and "". For SSE-S3 and SSE-KMS, instead use S3ServerSideEncryption.

Default: "" (no encryption)

S3SseCustomerKey
If using SSE-C encryption, the client key for S3 access.
S3SseKmsKeyId
If using SSE-KMS encryption, the key identifier (not the key) to pass to the Key Management Service. Vertica must have permission to use the key, which is managed through KMS.

Write performance

By default, Vertica performs writes using a single thread, but a single write usually includes multiple files or parts of files. For writes to S3, you can use a larger thread pool to perform writes in parallel. This thread pool is used for all file writes to S3, including file exports and writes to communal storage.

The size of the thread pool is controlled by the ObjStoreUploadParallelism configuration parameter. Each node has a single thread pool used for all file writes. In general, one or two threads per concurrent writer produces good results.

Server-side encryption

By default, Vertica reads and writes S3 data that is not encrypted. If the S3 bucket uses server-side encryption (SSE), you can configure Vertica to access it. S3 supports three types of server-side encryption: SSE-S3, SSE-KMS, and SSE-C.

Vertica must also have read or write permissions (depending on the operation) on the bucket.

SSE-S3

With SSE-S3, the S3 service manages encryption keys. Reads do not require additional configuration. To write to S3, the client (Vertica, in this case) must specify only the encryption algorithm.

If the S3 bucket is configured with the default encryption settings, Vertica can read and write data to them with no further changes. If the bucket does not use the default encryption settings, set the S3ServerSideEncryption configuration parameter or the serverSideEncryption field in S3BucketConfig to AES256.

SSE-KMS

With SSE-KMS, encryption keys are managed by the Key Management Service (KMS). The client must supply a KMS key identifier (not the actual key) when writing data. For all operations, the client must have permission to use the KMS key. These permissions are managed in KMS, not in Vertica.

To use SSE-KMS:

  • Set the S3ServerSideEncryption configuration parameter or the serverSideEncryption field in S3BucketConfig to aws:kms.

  • Set the S3SseKmsKeyId configuration parameter or the sseKmsKeyId field in S3BucketConfig to the key ID.

SSE-C

With SSE-C, the client manages encryption keys and provides them to S3 for each operation.

To use SSE-C:

  • Set the S3SseCustomerAlgorithm configuration parameter or the sseCustomerAlgorithm field in S3BucketConfig to AES256.

  • Set the S3SseCustomerKey configuration parameter or the sseCustomerKey field in S3BucketConfig to the access key. The value can be either a 32-character plaintext key or a 44-character base64-encoded key.

HTTP(S) logging (AWS)

When a request to AWS fails or is retried, or Vertica detects a performance degradation, Vertica logs the event in the UDFS_EVENTS system table. The event type for these records is HttpRequestAttempt, and the description is a JSON object with more details. For example:

=> SELECT filesystem, event, description FROM UDFS_EVENTS;
 filesystem |        event        |                                                                 description
------------+---------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------
 S3         |  HttpRequestAttempt | {"AttemptStartTimeGMT":"2024-01-18 16:45:52.663","RetryCount":0,"HttpMethod":"GET","URIString":"s3://mybucket/s3_2.dat","AmzRequestId":"79104EXAMPLEB723","AmzId2":"IOWQ4fDEXAMPLEQM+ey7N9WgVhSnQ6JEXAMPLEZb7hSQDASK+Jd1vEXAMPLEa3Km","HttpStatusCode":404,"HttpRequestLatency":2,"SdkExceptionMessage":"No response body."}
 S3         |  HttpRequestAttempt | {"AttemptStartTimeGMT":"2024-01-18 16:46:02.663","RetryCount":1,"HttpMethod":"GET","URIString":"s3://mybucket/s3_2.dat","AmzRequestId":"79104EXAMPLEB791","AmzId2":"JPXQ4fDEXAMPLEQM+ey7N9WgVhSnQ6JEXAMPLEZb7hSQDASK+Jd1vEXAMPLEa4Ln","HttpStatusCode":206,"HttpRequestLatency":1}
(2 rows)

To extract fields from this description, use the MAPJSONEXTRACTOR function:

=> SELECT filesystem, event, 
     (MapJSONExtractor(description))['RetryCount'] as RetryCount, 
     (MapJSONExtractor(description))['HttpStatusCode'] as HttpStatusCode
   FROM UDFS_EVENTS;
 filesystem |        event        | RetryCount | HttpStatusCode
------------+---------------------+------------+-----------------
 S3         |  HttpRequestAttempt |          0 |            404
 S3         |  HttpRequestAttempt |          1 |            206
(2 rows)

For details on AWS-specific fields such as AmzRequestId, see the AWS documentation.

The UDFS_STATISTICS system table records the accumulated duration for all HTTP requests in the TOTAL_REQUEST_DURATION_MS column. This table also provides computed values for average duration and throughput.

Examples

The following example sets a database-wide AWS region and credentials:

=> ALTER DATABASE DEFAULT SET AWSRegion='us-west-1';
=> ALTER DATABASE DEFAULT SET AWSAuth = 'myaccesskeyid123456:mysecretaccesskey123456789012345678901234';

The following example loads data from S3. You can use a glob if all files in the glob can be loaded together. In the following example, AWS_DataLake contains only ORC files.

=> COPY t FROM 's3://datalake/*' ORC;

You can specify a list of comma-separated S3 buckets as in the following example. All buckets must be in the same region. To load from more than one region, use separate COPY statements and change the value of AWSRegion between calls.

=> COPY t FROM 's3://AWS_Data_1/sales.parquet', 's3://AWS_Data_2/sales.parquet' PARQUET;

The following example creates a user storage location and a role, so that users without their own S3 credentials can read data from S3 using the server credential.

   --- set database-level credential (once):
=> ALTER DATABASE DEFAULT SET AWSAuth = 'myaccesskeyid123456:mysecretaccesskey123456789012345678901234';

=> CREATE LOCATION 's3://datalake' SHARED USAGE 'USER' LABEL 's3user';

=> CREATE ROLE ExtUsers;
   --- Assign users to this role using GRANT (Role).

=> GRANT READ ON LOCATION 's3://datalake' TO ExtUsers;

The configuration properties for a given bucket may differ based on its type. The following S3BucketConfig setting is for an AWS bucket (AWSBucket) and a Pure Storage bucket (PureStorageBucket). AWSBucket doesn't specify an endpoint, so Vertica uses the AWSEndpoint configuration parameter, which defaults to s3.amazonaws.com:

=> ALTER DATABASE DEFAULT SET S3BucketConfig=
'[
    {
        "bucket": "AWSBucket",
        "region": "us-east-2",
        "protocol": "https",
        "requesterPays": true,
        "serverSideEncryption": "aes256"
    },
    {
        "bucket": "PureStorageBucket",
        "endpoint": "pure.mycorp.net:1234",
        "protocol": "http",
        "enableVirtualAddressing": false
    }
]';

The following example sets S3BucketCredentials for these two buckets:

=> ALTER SESSION SET S3BucketCredentials='
[
    {
        "bucket": "AWSBucket",
        "accessKey": "<AK0>",
        "secretAccessKey": "<SAK0>",
        "sessionToken": "1234567890"
    },
    {
        "bucket": "PureStorageBucket",
        "accessKey": "<AK1>",
        "secretAccessKey": "<SAK1>"
    }
]';

The following example sets an STS temporary session token. Vertica uses the session token to access S3 with the specified credentials and bypasses checking for a USER storage location.

$ aws sts get-session-token
  {
    "Credentials": {
        "AccessKeyId": "ASIAJZQNDVS727EHDHOQ",
        "SecretAccessKey": "F+xnpkHbst6UPorlLGj/ilJhO5J2n3Yo7Mp4vYvd",
        "SessionToken": "FQoDYXdzEKv//////////wEaDMWKxakEkCyuDH0UjyKsAe6/3REgW5VbWtpuYyVvSnEK1jzGPHi/jPOPNT7Kd+ftSnD3qdaQ7j28SUW9YYbD50lcXikz/HPlusPuX9sAJJb7w5oiwdg+ZasIS/+ejFgCzLeNE3kDAzLxKKsunvwuo7EhTTyqmlLkLtIWu9zFykzrR+3Tl76X7EUMOaoL31HOYsVEL5d9I9KInF0gE12ZB1yN16MsQVxpSCavOFHQsj/05zbxOQ4o0erY1gU=",
        "Expiration": "2018-07-18T05:56:33Z"
    }
  }

$ vsql
=> ALTER SESSION SET AWSAuth = 'ASIAJZQNDVS727EHDHOQ:F+xnpkHbst6UPorlLGj/ilJhO5J2n3Yo7Mp4vYvd';
=> ALTER SESSION SET AWSSessionToken = 'FQoDYXdzEKv//////////wEaDMWKxakEkCyuDH0UjyKsAe6/3REgW5VbWtpuYyVvSnEK1jzGPHi/jPOPNT7Kd+ftSnD3qdaQ7j28SUW9YYbD50lcXikz/HPlusPuX9sAJJb7w5oiwdg+ZasIS/+ejFgCzLeNE3kDAzLxKKsunvwuo7EhTTyqmlLkLtIWu9zFykzrR+3Tl76X7EUMOaoL31HOYsVEL5d9I9KInF0gE12ZB1yN16MsQVxpSCavOFHQsj/05zbxOQ4o0erY1gU=';

See also

Per-Bucket S3 Configurations

1 - Per-bucket S3 configurations

You can manage configurations and credentials for individual buckets with the S3BucketConfig and S3BucketCredentials configuration parameters.

You can manage configurations and credentials for individual buckets with the S3BucketConfig and S3BucketCredentials configuration parameters. These parameters each take a JSON object, whose respective properties behave like the related S3 configuration parameters.

For example, you can create a different configuration for each of your S3 buckets by setting S3BucketConfig at the database level with ALTER DATABASE. The following S3BucketConfig specifies several common bucket properties:

=> ALTER DATABASE DEFAULT SET S3BucketConfig='
[
    {
        "bucket": "exampleAWS",
        "region": "us-east-2",
        "protocol": "https",
        "requesterPays": true
    },
    {
        "bucket": "examplePureStorage",
        "endpoint": "pure.mycorp.net:1234",
        "protocol": "http",
        "enableVirtualAddressing": false
    }
]';

Users can then access a bucket by setting S3BucketCredentials at the session level with ALTER SESSION. The following S3BucketCredentials specifies all properties and authenticates to both exampleAWS and examplePureStorage simultaneously:

=> ALTER SESSION SET S3BucketCredentials='
[
    {
        "bucket": "exampleAWS",
        "accessKey": "<AK0>",
        "secretAccessKey": "<SAK0>",
        "sessionToken": "1234567890"
    },
    {
        "bucket": "examplePureStorage",
        "accessKey": "<AK1>",
        "secretAccessKey": "<SAK1>",
    }
]';

The recommended usage is as follows:

  • Define in your S3 storage system one set of credentials per principal, per storage system.

  • It is often most convenient to set S3BucketConfig once at the database level and have users authenticate by setting S3BucketCredentials at the session level.

  • To access buckets outside those configured at the database level, set both S3BucketConfig and S3BucketCredentials at the session level.

If you cannot define credentials for your S3 storage, you can set S3BucketCredentials or AWSAuth at the database level with ALTER DATABASE, but this comes with certain drawbacks:

  • Storing credentials statically in another location (in this case, in the Vertica catalog) always incurs additional risk.

  • This increases overhead for the dbadmin, who needs to create user storage locations and grant access to each user or role.

  • Users share one set of credentials, increasing the potential impact if the credentials are compromised.

Precedence of per-bucket and standard parameters

Vertica uses the following rules to determine the effective set of properties for an S3 connection:

  • If set, S3BucketCredentials takes priority over its standard parameters. S3BucketCredentials is checked first at the session level and then at the database level.

  • The level/source of the S3 credential parameters determines the source of the S3 configuration parameters:

    • If credentials come from the session level, then the configuration can come from either the session or database level (with the session level taking priority).

    • If your credentials come from the database level, then the configuration can only come from the database level.

  • If S3BucketConfig is set, it takes priority over its standard parameters. If an S3BucketConfig property isn't specified, Vertica falls back to the missing property's equivalent parameter. For example, if S3BucketConfig specifies every property except protocol, Vertica falls back to the standard parameter AWSEnableHttps.

Examples

Multiple buckets

This example configures a real Amazon S3 bucket AWSBucket and a Pure Storage bucket PureStorageBucket with S3BucketConfig.

AWSBucket does not specify an endpoint or protocol, so Vertica falls back to AWSEndpoint (defaults to s3.amazonaws.com) and AWSEnableHttps (defaults to 1).

In this example environment, access to the PureStorageBucket is over a secure network, so HTTPS is disabled:

=> ALTER DATABASE DEFAULT SET S3BucketConfig='
[
    {
        "bucket": "AWSBucket",
        "region": "us-east-2"
    },
    {
        "bucket": "PureStorageBucket",
        "endpoint": "pure.mycorp.net:1234",
        "protocol": "http",
        "enableVirtualAddressing": false
    }
]';

Bob can then set S3BucketCredentials at the session level to authenticate to AWSBucket:

=> ALTER SESSION SET S3BucketCredentials='
[
    {
        "bucket": "AWSBucket",
        "accessKey": "<AK0>",
        "secretAccessKey": "<SAK0>",
        "sessionToken": "1234567890"
    }
]';

Similarly, Alice can authenticate to PureStorageBucket:

=> ALTER SESSION SET S3BucketCredentials='
[
    {
        "bucket": "PureStorageBucket",
        "accessKey": "<AK1>",
        "secretAccessKey": "<SAK1>"
    }
]';

Charlie provides credentials for both AWSBucket and PureStorageBucket and authenticates to them simultaneously. This allows him to perform cross-endpoint joins, export from one bucket to another, etc.

=> ALTER SESSION SET S3BucketCredentials='
[
    {
        "bucket": "AWSBucket",
        "accessKey": "<AK0>",
        "secretAccessKey": "<SAK0>",
        "sessionToken": "1234567890"
    },
    {
        "bucket": "PureStorageBucket",
        "accessKey": "<AK1>",
        "secretAccessKey": "<SAK1>"
    }
]';

S3 server-side encryption

S3 has three types of server-side encryption: SSE-S3, SSE-KMS, and SSE-C. The following example configures access using SSE-KMS:

=> ALTER DATABASE DEFAULT SET S3BucketConfig='
[
    {
        "bucket": "AWSBucket",
        "region": "us-east-2",
        "serverSideEncryption": "aws:kms",
        "sseKmsKeyId": "1234abcd-12ab-34cd-56ef-1234567890ab"
    }
]';

For more information, see Server-Side Encryption.

Non-amazon S3 storage with AWSEndpoint and S3BucketConfig

If AWSEndpoint is set to a non-Amazon S3 bucket like Pure Storage or MinIO and you want to configure S3BucketConfig for a real Amazon S3 bucket, the following requirements apply:

  • If your real Amazon S3 region is not us-east-1 (the default), you must specify the region.

  • Set endpoint to an empty string ("").

In this example, AWSEndpoint is set to a Pure Storage bucket.

=> ALTER DATABASE DEFAULT SET AWSEndpoint='pure.mycorp.net:1234';

To configure S3BucketConfig for a real Amazon S3 bucket realAmazonS3Bucket in region "us-east-2":

=> ALTER DATABASE DEFAULT SET S3BucketConfig='
[
    {
        "bucket": "realAmazonS3Bucket",
        "region": "us-east-2",
        "endpoint": ""
    },
]';