Skip to content

Administer Big Data Extensions

Overview

KNIME Big Data Extensions integrate Apache Spark and the Apache Hadoop ecosystem with KNIME Analytics Platform.

This guide supports IT professionals who need to connect KNIME Analytics Platform to an existing Hadoop or Spark environment.

Complete the steps in this guide so that users of KNIME Analytics Platform can run Spark workflows. Running Spark workflows on KNIME Server requires additional steps described in the Secured Cluster Connection Guide for KNIME Server.

Overall architecture

Overall architecture of KNIME Big Data Extensions

The KNIME Extension for Apache Spark requires Apache Livy to run as a REST service on an edge or front-end node of the cluster. Review the compatibility lists below and learn how to install Livy.

Compatibility

General compatibility

The KNIME Extension for Apache Spark is compatible with:

  • Spark 2.x - 3.5
  • Livy 0.4 - 0.7

Cloudera CDP compatibility

The KNIME Extension for Apache Spark is compatible with:

Cloudera CDH compatibility

The KNIME Extension for Apache Spark is compatible with:

  • Spark 2.x on CDH 5 as provided by Cloudera CDS
  • Spark 2.x as included in CDH 6

WARNING

Cloudera CDH 5 and 6 do not include Livy. KNIME therefore provides CSDs and parcels for Livy. See the Cloudera CDH section for download and installation details.

Cloudera HDP compatibility

The KNIME Extension for Apache Spark is compatible with:

  • Spark 2.x as included in HDP 2.6.3 - 2.6.5
  • Spark 2.x as included in HDP 3.0.0 - 3.1.5

Amazon EMR compatibility

The KNIME Extension for Apache Spark is compatible with:

  • EMR 7.x with Spark 3.5 and Livy 0.7 - 0.8
  • EMR 6.x with Spark 3.x and Livy 0.6 - 0.7
  • EMR 5.9+ with Spark 2.x and Livy 0.4 - 0.7

Phase Out of Support for H2O Sparkling Water

Starting with KNIME Analytics Platform version 5.3, support for H2O Sparkling Water is being phased out. Upcoming Spark versions will not receive updates. Sparkling Water is no longer supported on clusters with Spark 3.4 or higher (for example, Databricks). The Create Local Big Data Environment node remains available for now, but support will be removed soon.

Apache Livy Setup

Cloudera CDP

Cloudera Runtime 7.0 - 7.1 includes Spark 2.4 and Livy as a service. Cloudera provides Spark 3.x as a Custom Service Descriptor that can coexist with the included Spark version. See Installing CDS Powered by Apache Spark in the CDS 3.3, CDS 3.2, CDS 3.1, or CDS 3.0 documentation for more information. Livy in the Spark 3.x CSD uses port 28998 instead of the default 8998.

TIP

If you plan to run Spark workflows on KNIME Server, review the Secured Cluster Connection Guide for KNIME Server to enable user impersonation.

Cloudera CDH

KNIME provides a CSD and parcel for Cloudera CDH so that you can install Livy as an add-on service. The current Livy version is {livy-release}.

TIP

The following steps describe how to install Livy as a managed service through Cloudera Manager using a parcel. Consult the official Cloudera parcel documentation if you need additional guidance.

Prerequisites

  • A cluster with CDH 5.8 or newer, or CDH 6.0 or newer
    • Only on CDH 5: Spark 2.2 or higher installed as an add-on service (provided by Cloudera CDS)
  • Root shell access (for example, via SSH) on the machine that hosts Cloudera Manager
  • Full administrative access to the Cloudera Manager web UI

Installation steps

In a root shell on the machine where Cloudera Manager is installed:

  1. Download a matching CSD from CSDs for Cloudera CDH to /opt/cloudera/csd/.

  2. If Cloudera Manager cannot access the public internet, copy the matching .parcel and .sha1 files from Parcels for Cloudera CDH to /opt/cloudera/parcel-repo.

  3. Restart Cloudera Manager, for example:

    bash
    systemctl restart cloudera-scm-server

In the Cloudera Manager web UI:

  1. Open the Parcel manager and locate the LIVY parcel.
  2. Download (if needed), distribute, and activate the LIVY parcel.
  3. Add the Livy service to your cluster. See the Cloudera documentation on adding services.
  4. In the HDFS service configuration, add the following settings to the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml:
    • hadoop.proxyuser.livy.hosts=*
    • hadoop.proxyuser.livy.groups=*
  5. If your cluster uses HDFS Transparent Encryption, add the following settings to the Key Management Server Advanced Configuration Snippet (Safety Valve) for kms-site.xml:
    • hadoop.kms.proxyuser.livy.hosts=*
    • hadoop.kms.proxyuser.livy.groups=*
  6. If you plan to run Spark workflows on KNIME Server, review the Secured Cluster Connection Guide for KNIME Server to enable user impersonation.
  7. Restart every service affected by your configuration changes.

Cloudera HDP

Hortonworks Data Platform already includes compatible versions of Apache Livy and Spark 2. Review the Cloudera HDP compatibility section for supported versions. Follow the Hortonworks documentation to install Spark with the Livy for Spark2 Server component:

::: note The KNIME Extension for Apache Spark only supports Livy for Spark2 Server, which uses Spark 2. The Livy for Spark Server component is not supported because it relies on Spark 1. :::

Amazon EMR

Amazon EMR already includes compatible versions of Apache Livy and Spark 2. Confirm the supported versions in the Amazon EMR compatibility section and select Livy in the software configuration of your cluster.

Downloads

Apache Livy downloads

CSDs for Cloudera CDH

Parcels for Cloudera CDH

CDH 5
PlatformParcelSHA
RHEL 7parcelsha
RHEL 6parcelsha
RHEL 5parcelsha
SLES 12parcelsha
SLES 11parcelsha
Ubuntu 16 (Xenial)parcelsha
Ubuntu 14 (Trusty)parcelsha
Ubuntu 12 (Precise)parcelsha
Debian 8 (Jessie)parcelsha
Debian 7 (Wheezy)parcelsha
CDH 6
PlatformParcelSHA
RHEL 7parcelsha
RHEL 6parcelsha
RHEL 5parcelsha
SLES 12parcelsha
SLES 11parcelsha
Ubuntu 16 (Xenial)parcelsha
Ubuntu 14 (Trusty)parcelsha
Ubuntu 12 (Precise)parcelsha
Debian 8 (Jessie)parcelsha
Debian 7 (Wheezy)parcelsha