<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>OpenText Analytics Database 26.2.x – Storage locations on HDFS</title>
    <link>/en/admin/managing-storage-locations/storage-locations-on-hdfs/</link>
    <description>Recent content in Storage locations on HDFS on OpenText Analytics Database 26.2.x</description>
    <generator>Hugo -- gohugo.io</generator>
    
	  <atom:link href="/en/admin/managing-storage-locations/storage-locations-on-hdfs/index.xml" rel="self" type="application/rss+xml" />
    
    
      
        
      
    
    
    <item>
      <title>Admin: Requirements for HDFS storage locations</title>
      <link>/en/admin/managing-storage-locations/storage-locations-on-hdfs/requirements-hdfs-storage-locations/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/admin/managing-storage-locations/storage-locations-on-hdfs/requirements-hdfs-storage-locations/</guid>
      <description>
        
        
        
&lt;div class=&#34;admonition caution&#34; role=&#34;alert&#34;&gt;
&lt;h4 class=&#34;admonition-head&#34;&gt;Caution&lt;/h4&gt;

If you use HDFS storage locations, the HDFS data must be available when you start Vertica. Your HDFS cluster must be operational, and the ROS files must be present. If you moved data files, or they are corrupted, or your HDFS cluster is not responsive, Vertica cannot start.

&lt;/div&gt;
&lt;p&gt;To store data on HDFS, verify that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Your Hadoop cluster has WebHDFS enabled.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;All of the nodes in your database cluster can connect to all of the nodes in your Hadoop cluster. Any firewall between the two clusters must allow connections on the ports used by HDFS.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If your HDFS cluster is unsecured, you have a Hadoop user whose username matches the name of the database &lt;a class=&#34;glosslink&#34; href=&#34;../../../../en/glossary/db-superuser/&#34; title=&#34;&#34;&gt;database superuser&lt;/a&gt; (usually named dbadmin). This Hadoop user must have read and write access to the HDFS directory where you want the database to store its data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If your HDFS cluster uses Kerberos authentication:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;You have a Kerberos principal for the database, and it has read and write access to the HDFS directory that will be used for the storage location. See &lt;a href=&#34;#Kerberos&#34;&gt;Kerberos&lt;/a&gt; below for instructions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The Kerberos KDC is running.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Your HDFS cluster has enough storage available for database data. See &lt;a href=&#34;#Space&#34;&gt;Space Requirements&lt;/a&gt; below for details.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The data you store in an HDFS-backed storage location does not expand your database&#39;s size beyond any data allowance in your OpenText™ Analytics Database license. Data stored in an HDFS-backed storage location is counted as part of any data allowance set by your license. See &lt;a href=&#34;../../../../en/admin/managing-licenses/#&#34;&gt;Managing licenses&lt;/a&gt; in the Administrator&#39;s Guide for more information.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Backup/Restore has &lt;a href=&#34;../../../../en/admin/backup-and-restore/requirements-backing-up-and-restoring-hdfs-storage-locations/&#34;&gt;additional requirements&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a name=&#34;Space&#34;&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;space-requirements&#34;&gt;Space requirements&lt;/h2&gt;
&lt;p&gt;If your database is &lt;a class=&#34;glosslink&#34; href=&#34;../../../../en/glossary/k-safety/&#34; title=&#34;For more information, see Designing for K-Safety.&#34;&gt;K-safe&lt;/a&gt;, HDFS-based storage locations contain two copies of the data you store in them. One copy is the primary projection, and the other is the buddy projection. If you have enabled HDFS&#39;s data-redundancy feature, Hadoop stores both projections multiple times. This duplication might seem excessive. However, it is similar to how a RAID level 1 or higher stores redundant copies of both the primary and buddy projections. The redundant copies also help the performance of HDFS by enabling multiple nodes to process a request for a file.&lt;/p&gt;
&lt;p&gt;Verify that your HDFS installation has sufficient space available for redundant storage of both the primary and buddy projections of your K-safe data. You can adjust the number of duplicates stored by HDFS by setting the &lt;code&gt;HadoopFSReplication&lt;/code&gt; configuration parameter. See &lt;a href=&#34;../../../../en/admin/managing-storage-locations/storage-locations-on-hdfs/troubleshooting-hdfs-storage-locations/&#34;&gt;Troubleshooting HDFS Storage Locations&lt;/a&gt; for details.&lt;/p&gt;
&lt;p&gt;&lt;a name=&#34;Kerberos&#34;&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;kerberos&#34;&gt;Kerberos&lt;/h2&gt;
&lt;p&gt;To use a storage location in HDFS with Kerberos, take the following additional steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Create a Kerberos principal for each database node as explained in &lt;a href=&#34;../../../../en/hadoop-integration/accessing-kerberized-hdfs-data/using-kerberos-with/#&#34;&gt;Using Kerberos with OpenText&amp;amp;trade; Analytics Database&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Give all node principals read and write permission to the HDFS directory you will use as a storage location.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you plan to use &lt;code&gt;vbr&lt;/code&gt; to back up and restore the location, see additional requirements in &lt;a href=&#34;../../../../en/admin/backup-and-restore/requirements-backing-up-and-restoring-hdfs-storage-locations/#&#34;&gt;Requirements for backing up and restoring HDFS storage locations&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;adding-hdfs-storage-locations-to-new-nodes&#34;&gt;Adding HDFS storage locations to new nodes&lt;/h2&gt;
&lt;p&gt;If you add nodes to your database cluster, they do not automatically have access to existing HDFS storage locations. You must manually create the storage location for the new node using the &lt;a href=&#34;../../../../en/sql-reference/statements/create-statements/create-location/#&#34;&gt;CREATE LOCATION&lt;/a&gt; statement. Do not use the &lt;span class=&#34;sql&#34;&gt;ALL NODES&lt;/span&gt; option in this statement. Instead, use the &lt;span class=&#34;sql&#34;&gt;NODE&lt;/span&gt; option with the name of the new node to tell the database that just that node needs to add the shared location.

&lt;div class=&#34;admonition caution&#34; role=&#34;alert&#34;&gt;
&lt;h4 class=&#34;admonition-head&#34;&gt;Caution&lt;/h4&gt;

You must manually create the storage location. Otherwise, the new node uses the default storage policy (usually, storage on the local Linux file system) to store data that the other nodes store in HDFS. As a result, the node can run out of disk space.

&lt;/div&gt;&lt;/p&gt;
&lt;p&gt;Consider an HDFS storage location that was created on a three-node cluster with the following statements:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;=&amp;gt; CREATE LOCATION &amp;#39;hdfs://hadoopNS/vertica/colddata&amp;#39; ALL NODES SHARED
    USAGE &amp;#39;data&amp;#39; LABEL &amp;#39;coldstorage&amp;#39;;

=&amp;gt; SELECT SET_OBJECT_STORAGE_POLICY(&amp;#39;SchemaName&amp;#39;,&amp;#39;coldstorage&amp;#39;);
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;The following example shows how to add the storage location to a new cluster node:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;=&amp;gt; CREATE LOCATION &amp;#39;hdfs://hadoopNS/vertica/colddata&amp;#39; NODE &amp;#39;v_vmart_node0004&amp;#39;
   SHARED USAGE &amp;#39;data&amp;#39; LABEL &amp;#39;coldstorage&amp;#39;;
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Any &lt;a href=&#34;../../../../en/admin/managing-db/managing-nodes/active-standby-nodes/&#34;&gt;active standby nodes&lt;/a&gt; in your cluster when you create an HDFS storage location automatically create their own instances of the location. When the standby node takes over for a down node, it uses its own instance of the location to store data for objects using the HDFS storage policy. Treat standby nodes added after you create the storage location as any other new node. You must manually define the HDFS storage location.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Admin: How the HDFS storage location stores data</title>
      <link>/en/admin/managing-storage-locations/storage-locations-on-hdfs/how-hdfs-storage-location-stores-data/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/admin/managing-storage-locations/storage-locations-on-hdfs/how-hdfs-storage-location-stores-data/</guid>
      <description>
        
        
        &lt;p&gt;OpenText™ Analytics Database stores data in storage locations on HDFS similarly to the way it stores data in the Linux file system. When you create a storage location on HDFS, the database stores the &lt;a class=&#34;glosslink&#34; href=&#34;../../../../en/glossary/ros-read-optimized-store/&#34; title=&#34;Read Optimized Store (ROS) is a highly optimized, read-oriented, disk storage structure, organized by projection.&#34;&gt;ROS&lt;/a&gt; containers holding its data on HDFS. You can choose which data uses the HDFS storage location: from the data for just a single table or partition to all of the database&#39;s data.&lt;/p&gt;
&lt;p&gt;When the database reads data from or writes data to an HDFS storage location, the node storing or retrieving the data contacts the Hadoop cluster directly to transfer the data. If a single ROS container file is split among several HDFS nodes, the node connects to each of them. The node retrieves the pieces and reassembles the file. Because each node fetches its own data directly from the source, data transfers are parallel, increasing their efficiency. Having the nodes directly retrieve the file splits also reduces the impact on the Hadoop cluster.&lt;/p&gt;
&lt;h2 id=&#34;what-you-can-store-in-hdfs&#34;&gt;What you can store in HDFS&lt;/h2&gt;
&lt;p&gt;Use HDFS storage locations to store only data. You cannot store catalog information in an HDFS storage location.

&lt;div class=&#34;admonition caution&#34; role=&#34;alert&#34;&gt;
&lt;h4 class=&#34;admonition-head&#34;&gt;Caution&lt;/h4&gt;

While it is possible to use an HDFS storage location for temporary data storage, you must never do so. Using HDFS for temporary storage causes severe performance issues.

&lt;/div&gt;&lt;/p&gt;
&lt;h2 id=&#34;what-hdfs-storage-locations-cannot-do&#34;&gt;What HDFS storage locations cannot do&lt;/h2&gt;
&lt;p&gt;Because OpenText™ Analytics Database uses storage locations to store ROS containers in a proprietary format, MapReduce and other Hadoop components cannot access your ROS data stored in HDFS. Never allow another program that has access to HDFS to write to the ROS files. Any outside modification of these files can lead to data corruption and loss. Applications must use the &lt;a href=&#34;../../../../en/connecting-to/client-libraries/&#34;&gt;client libraries&lt;/a&gt; to access data. If you want to share ROS data with other Hadoop components, you can export it (see &lt;a href=&#34;../../../../en/data-export/file-export/#&#34;&gt;File export&lt;/a&gt;).&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Admin: Best practices for SQL on Apache Hadoop</title>
      <link>/en/admin/managing-storage-locations/storage-locations-on-hdfs/best-practices-sql-on-hadoop/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/admin/managing-storage-locations/storage-locations-on-hdfs/best-practices-sql-on-hadoop/</guid>
      <description>
        
        
        &lt;p&gt;If you are using the OpenText™ Analytics Database for SQL on Apache Hadoop product, use the following best practices for storage locations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Place only data type storage locations on HDFS storage.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Place temp space directly on the local Linux file system, not in HDFS.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For the best performance, place the database catalog directly on the local Linux file system.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Create the database first on a local Linux file system. Then, you can extend the database to HDFS storage locations and set storage policies that exclusively place data blocks on the HDFS storage location.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For better performance, if you are running the database only on a subset of the HDFS nodes, do not run the HDFS balancer on them. The HDFS balancer can move data blocks farther away, causing the database to read non-local data during query execution. Queries run faster if they do not require network I/O.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Generally, HDFS requires approximately 2 GB of memory for each node in the cluster. To support this requirement in your database configuration:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Create a 2-GB resource pool.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Do not assign any database execution resources to this pool. This approach reserves the space for use by HDFS.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Alternatively, use Ambari or Cloudera Manager to find the maximum heap size required by HDFS and set the size of the resource pool to that value.&lt;/p&gt;
&lt;p&gt;For more about how to configure resource pools, see &lt;a href=&#34;../../../../en/admin/managing-db/managing-workloads/#&#34;&gt;Managing workloads&lt;/a&gt;.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Admin: Troubleshooting HDFS storage locations</title>
      <link>/en/admin/managing-storage-locations/storage-locations-on-hdfs/troubleshooting-hdfs-storage-locations/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>/en/admin/managing-storage-locations/storage-locations-on-hdfs/troubleshooting-hdfs-storage-locations/</guid>
      <description>
        
        
        &lt;p&gt;This topic explains some common issues with HDFS storage locations.&lt;/p&gt;
&lt;p&gt;&lt;a name=&#34;HDFS-Sto&#34;&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;hdfs-storage-disk-consumption&#34;&gt;HDFS storage disk consumption&lt;/h2&gt;
&lt;p&gt;By default, HDFS makes three copies of each file it stores. This replication helps prevent data loss due to disk or system failure. It also helps increase performance by allowing several nodes to handle a request for a file.&lt;/p&gt;
&lt;p&gt;A database with a &lt;a class=&#34;glosslink&#34; href=&#34;../../../../en/glossary/k-safety/&#34; title=&#34;For more information, see Designing for K-Safety.&#34;&gt;K-safety&lt;/a&gt; value of 1 or greater also stores its data redundantly using buddy projections.&lt;/p&gt;
&lt;p&gt;When a K-Safe database stores data in an HDFS storage location, its data redundancy is compounded by HDFS&#39;s redundancy. HDFS stores three copies of the primary projection&#39;s data, plus three copies of the buddy projection for a total of six copies of the data.&lt;/p&gt;
&lt;p&gt;If you want to reduce the amount of disk storage used by HDFS locations, you can alter the number of copies of data that HDFS stores. The configuration parameter named HadoopFSReplication controls the number of copies of data HDFS stores.&lt;/p&gt;
&lt;p&gt;You can determine the current HDFS disk usage by logging into the Hadoop NameNode and issuing the command:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ hdfs dfsadmin -report
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;This command prints the usage for the entire HDFS storage, followed by details for each node in the Hadoop cluster. The following example shows the beginning of the output from this command, with the total disk space highlighted:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ hdfs dfsadmin -report
Configured Capacity: 51495516981 (47.96 GB)
Present Capacity: 32087212032 (29.88 GB)
DFS Remaining: 31565144064 (29.40 GB)
DFS Used: 522067968 (&lt;span class=&#34;code-input&#34;&gt;497.88 MB&lt;/span&gt;)
DFS Used%: 1.63%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
. . .
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;After loading a simple million-row table into a table stored in an HDFS storage location, the report shows greater disk usage:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;Configured Capacity: 51495516981 (47.96 GB)
Present Capacity: 32085299338 (29.88 GB)
DFS Remaining: 31373565952 (29.22 GB)
DFS Used: 711733386 (&lt;span class=&#34;code-input&#34;&gt;678.76 MB&lt;/span&gt;)
DFS Used%: 2.22%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
. . .
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;The following example demonstrates:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Creating the storage location on HDFS.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Dropping the table in the database.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Setting the HadoopFSReplication configuration option to 1. This tells HDFS to store a single copy of an HDFS storage location&#39;s data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Recreating the table and reloading its data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;=&amp;gt; CREATE LOCATION &amp;#39;hdfs://hadoopNS/user/dbadmin&amp;#39; ALL NODES SHARED
    USAGE &amp;#39;data&amp;#39; LABEL &amp;#39;hdfs&amp;#39;;
CREATE LOCATION

=&amp;gt; DROP TABLE messages;
DROP TABLE

=&amp;gt; ALTER DATABASE DEFAULT SET PARAMETER HadoopFSReplication = 1;

=&amp;gt; CREATE TABLE messages (id INTEGER, text VARCHAR);
CREATE TABLE

=&amp;gt; SELECT SET_OBJECT_STORAGE_POLICY(&amp;#39;messages&amp;#39;, &amp;#39;hdfs&amp;#39;);
 SET_OBJECT_STORAGE_POLICY
----------------------------
Object storage policy set.
(1 row)

=&amp;gt; COPY messages FROM &amp;#39;/home/dbadmin/messages.txt&amp;#39;;
 Rows Loaded
-------------
1000000
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Running the HDFS report on Hadoop now shows less disk space use:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ hdfs dfsadmin -report
Configured Capacity: 51495516981 (47.96 GB)
Present Capacity: 32086278190 (29.88 GB)
DFS Remaining: 31500988416 (29.34 GB)
DFS Used: 585289774 (&lt;span class=&#34;code-input&#34;&gt;558.18 MB&lt;/span&gt;)
DFS Used%: 1.82%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
. . .
&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;admonition caution&#34; role=&#34;alert&#34;&gt;
&lt;h4 class=&#34;admonition-head&#34;&gt;Caution&lt;/h4&gt;

Reducing the number of copies of data stored by HDFS increases the risk of data loss. It can also negatively impact the performance of HDFS by reducing the number of nodes that can provide access to a file. This slower performance can impact the performance of database queries that involve data stored in an HDFS storage location.

&lt;/div&gt;
&lt;h2 id=&#34;error-6966-storagebundlewriter&#34;&gt;ERROR 6966: StorageBundleWriter&lt;/h2&gt;
&lt;p&gt;You might encounter Error 6966 when loading data into a storage location on a small Hadoop cluster (5 or fewer data nodes). This error is caused by the way HDFS manages the write pipeline and replication. You can mitigate this problem by reducing the number of replicas as explained in &lt;a href=&#34;#HDFS-Sto&#34;&gt;HDFS Storage Disk Consumption&lt;/a&gt;. For configuration changes you can make in the Hadoop cluster instead, see &lt;a href=&#34;https://community.hortonworks.com/articles/16144/write-or-append-failures-in-very-small-clusters-un.html&#34;&gt;this blog post from Hortonworks&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;kerberos-authentication-when-creating-a-storage-location&#34;&gt;Kerberos authentication when creating a storage location&lt;/h2&gt;
&lt;p&gt;If HDFS uses Kerberos authentication, then the CREATE LOCATION statement authenticates using the keytab principal, not the principal of the user performing the action. If the creation fails with an authentication error, verify that you have followed the steps described in &lt;a href=&#34;../../../../en/admin/managing-storage-locations/storage-locations-on-hdfs/requirements-hdfs-storage-locations/#Kerberos&#34;&gt;Kerberos&lt;/a&gt; to configure this principal.&lt;/p&gt;
&lt;p&gt;When creating an HDFS storage location on a Hadoop cluster using Kerberos, CREATE LOCATION reports the principal being used as in the following example:&lt;/p&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;=&amp;gt; CREATE LOCATION &amp;#39;hdfs://hadoopNS/user/dbadmin&amp;#39; ALL NODES SHARED
             USAGE &amp;#39;data&amp;#39; LABEL &amp;#39;coldstorage&amp;#39;;
NOTICE 0: Performing HDFS operations using kerberos principal [vertica/hadoop.example.com]
CREATE LOCATION
&lt;/code&gt;&lt;/pre&gt;&lt;h2 id=&#34;backup-or-restore-fails&#34;&gt;Backup or restore fails&lt;/h2&gt;
&lt;p&gt;For issues with backup/restore of HDFS storage locations, see &lt;a href=&#34;../../../../en/admin/backup-and-restore/troubleshooting-backup-and-restore/#&#34;&gt;Troubleshooting backup and restore&lt;/a&gt;.&lt;/p&gt;

      </description>
    </item>
    
  </channel>
</rss>
