Dataproc services

This page lists services that Dataproc image versions run on Dataproc cluster nodes.

All nodes

The following services run on all nodes in a cluster.

Node typeServiceImage versionsDescription
All nodesgoogle-dataproc-agentallReceives jobs from Dataproc and launches job drivers
google-fluentdallCollects and pushes logs to Logging

Standard clusters

The following services run on standard clusters.

Node typeServiceImage versionsDescription
Masterhadoop-hdfs-namenodeallManages the HDFS filesystem
hadoop-hdfs-secondarynamenodeallCheckpoints the NameNode
hadoop-mapreduce-historyserverallServes mapreduce application history information
hadoop-yarn-resourcemanagerallSchedules and manages YARN applications
hadoop-yarn-timelineserver1.3+Serves YARN application history information
hive-metastoreallManages Hive table metadata. As a default, uses the local mariadb (image versions < 1.5) or mysql (image versions 1.5+) database on the master node as the Hive table metadata store. Using the default database is not recommended because these databases are tied to the cluster's lifecycle. Instead, use either of the following as the Hive metastore database (in recommendation order):
  1. Dataproc Metastore
  2. Cloud SQL instance
hive-server2allServes queries received from clients (primarily beeline shell queries) against Hive
mariadb< 1.5A relational database used as the default underlying database for Hive metastore in Dataproc < 1.5 images
mysql1.5+A relational database used as the default underlying database for Hive metastore in Dataproc 1.5+ images
nfs-kernel-server< 1.3NFS is the Network File System.
spark-history-serverallServes Spark application history information
All Workershadoop-yarn-nodemanagerallLaunches and manages YARN containers
Primary Workers onlyhadoop-hdfs-datanodeallStores HDFS blocks

HA Clusters

In Dataproc High Availability (HA) clusters, different services run on different master nodes, as show below. HA cluster worker node services are the same as those listed for standard clusters.

Node typeServiceImage versionsDescription
All mastershadoop-hdfs-journalnodeallA quorum of journal nodes maintains an edit log of HDFS namespace modifications. If a failover occurs, the Standby NameNode reads the edit log and takes control from the Active NameNode.
hadoop-yarn-resourcemanagerallSchedules and manages YARN applications
hive-metastoreallManages Hive table metadata. As a default, uses the local mariadb (image versions < 1.5) or mysql (image versions 1.5+) database on the master node as the Hive table metadata store. Using the default database is not recommended because these databases are tied to the cluster's lifecycle. Instead, use either of the following as the Hive metastore database (in recommendation order):
  1. Dataproc Metastore
  2. Cloud SQL instance
hive-server2allServes queries received from clients (primarily beeline shell queries) against Hive
zookeeper-serverallA ZooKeeper quorum is used for distributed coordination. In High Availability (HA) clusters, it is used for HDFS NameNodes and YARN resource managers leader election.
Masters 0 and 1 onlyhadoop-hdfs-namenodeallManages the HDFS filesystem
hadoop-hdfs-zkfcallZKFC is the ZKFailoverController process, which runs with the HDFS NameNode. It monitors the health of the NameNode, and manages leader election via ZooKeeper in the event of a failover.
Master 0 onlyhadoop-mapreduce-historyserverallServes mapreduce application history information
hadoop-yarn-timelineserver1.3+Serves YARN application history information
mariadb< 1.5A relational database used as the default underlying database for Hive metastore in Dataproc < 1.5 images
mysql1.5+A relational database used as the default underlying database for Hive metastore in Dataproc 1.5+ images
nfs-kernel-server< 1.3NFS is the Network File System.
spark-history-serverallServes Spark application history information