What is KNIME Software?
One enterprise-grade software platform, two complementary tools: the open source KNIME Analytics Platform for creating data science, and the commercial KNIME Hub for productionizing data science.
If you are new to KNIME, you can download KNIME Analytics Platform via your internal catalog or directly on the KNIME website. We suggest you look at our guide to getting started to build your first workflow.
![image1](./img/image1.png)
Workflow design process: From KNIME Analytics Platform to KNIME Hub
As mentioned before, these are complementary tools.
KNIME Analytics Platform is our open source software for creating data science. Intuitive, open, and continuously integrating new developments, it makes understanding data and designing workflows and reusable components accessible to everyone.
KNIME Hub is an enterprise software for team-based collaboration, automation, management, and deployment of data science workflows as Data Apps, Services, and more.
As you start your data project with KNIME, you will create a workflow that can then be uploaded to KNIME Hub. KNIME Analytics Platform is how you’ll design your data process via a workflow. Once your workflow is ready, it can be easily automated or deployed when you upload it to KNIME Hub.
Before building a KNIME Workflow: project prerequisites
As with any project, prior to building your workflow, it’s important to understand the scope of what you’re working with. This checklist can guide you:
-
Is there a clearly defined goal that is measurable and achievable?
For example: We want to reduce the churn rate by 15%. -
What data sources are required to reach the goal? Do you have access to these sources?
-
Who should consume your deployed data science application, and how? For instance are you making:
-
A scheduled workflow that regularly creates and sends a report?
-
A scheduled workflow that gets executed regularly to process new data or apply a predictive model?
-
A Data App that gets deployed to KNIME WebPortal?
-
A predictive model that is accessible via REST API?
-
-
What kind of workflows do you need to achieve your goal?
-
What is the task of each individual workflow that you want to build?
-
What is the input and output of each individual workflow?
-
Do your workflows share some parts?
-
-
Define the requirements for the created workflows (how fast the execution should be, how often should it be executed, etc.).
-
What hardware and software do you need to meet all these requirements?
Best practices when working with KNIME Analytics Platform
Use proper naming for your workflows or groups
Use a clear naming convention for your workflows. From the workflow name, it should be clear what the workflow is doing (e.g. “Project_1” vs. “Read and Preprocess Customer Data”). If one of your workflows is followed by another workflow, you might want to introduce an order by using numbers as prefixes.
![image3](./img/image3.png)
Design your workflow in a secure, reusable, and efficient way
Avoid using local file paths
To make your workflow able to access the data sources whether it is executed on your machine or on any other machine, it is important to avoid local file paths. Instead use relative paths or, if your data is already available in a remote files system, manage your files and folders there directly. To do so, you can use a dynamic file system connection port and one of the many different connector nodes, e.g. the SMB Connector or the S3 Connector node.
Use credential nodes
It is not recommended that you save any credentials or confidential data in your nodes and workflows. Instead use the Credentials Configuration or Credentials Widget node to let the user specify credentials at runtime.
![image5](./img/image5.png)
Document your workflow
To make your workflow reusable, you and your team need to be able to quickly understand what it’s doing. It is therefore important to document your workflow — and if it’s large, you should structure it into different parts.
To structure a large workflow, you can use metanodes and components for the different parts and nest them. To document the workflows, you can:
-
Change the node labels to define what individual nodes do by double-clicking on their labels.
-
Create an annotation box. Just right-click anywhere in the workflow editor and select "New Workflow Annotation." Then type in your explanatory comment, resize the annotation window to make clear which group of nodes it refers to, and format the text using the context menu.
This workflow is an example of how to document workflows and add descriptions and annotations to them. -
Attach a workflow description by clicking on an empty spot in the workflow, then go into the Node Description view and click the edit button on the top-right.
Design your workflow to be efficient
To make the execution of your workflow efficient, it is good to follow some best practices and take into consideration how compute-intensive it is to execute each individual node.
-
Exclude superfluous columns before reading. In case not all columns of a dataset are actually used, you can avoid reading these columns by excluding them via the “Transformation” tab of the reader node.
-
To make the data access part of your workflow efficient, you should avoid reading the same file multiple times, but instead connect one reader node to multiple nodes.
-
Use the “Files in Folder” option to read multiple files with the same structure.
-
Remove redundant rows and columns early before performing expensive operations like joining or data aggregation.
-
Only use loops if it is absolutely necessary. Executing loops is normally expensive. Try to avoid them by using other nodes, e.g. String Manipulation (Multi Column) or Math Formula (Multi Column) nodes.
-
Push as much computation as possible to the database. In case you are working with databases, you can speed up the execution of your workflow by pushing as much of the preprocessing of the data as possible to the database server by using the DB preprocessing nodes before reading the data into KNIME Analytics Platform.
-
Close database connections with the DB Connection Closer node.
-
Delete disconnected nodes or workflow branches that are not actively needed.
To further improve the performance of your workflow, you can use the Timer Info node to find out how long it takes to execute individual nodes.
Best practices when working with KNIME Hub
Versioning
When working in a team, building and modifying workflows together, it is easy to lose progress by overwriting a colleague’s work. Sometimes you also make changes that you later discover to be wrong, and you will want to roll back your work to a previous version. When writing code, you usually use a version control system like GIT for this task. Such a system boosts productivity by tracking changes and ensuring that no important work is lost.
Read about how to use versioning on KNIME Hub:
Versions
KNIME Hub provides its own version control mechanism.
When uploading a workflow to a KNIME Hub this workflow will be a draft. In order to execute the workflow or create deployments you need to create a version of your workflow.
In this way creating a deployment of this version will allow you and the other team members to continue working on the workflow without distrupting the deployment.
Very important is to give the version you are creating a meaningful name and provide a description.
At any point in time you can have access to the versions of a workflow or component on KNIME Hub and restore old versions.
How to work as a team?
Working on the same project using components
When working on a larger project, you might want multiple team members to work on the same project and/or workflow at the same time. To do so, you can split the overall workflow into multiple parts, where each part is implemented within a component, for which you define the task/scope as well as the expected input and output columns.
For example, if you work on a prediction project (e.g. churn prediction
or lead scoring) where you have multiple data sources (e.g. an activity
protocol, interaction on your website, email data, or contract
information), you might want to generate some features based on each
data table. In this case, you can split the preprocessing into multiple
components with a clear description — which features should be created
based on an input dataset, and which output is expected (“ID Column
generated features,” or similar).
Then each team member can work on one of these tasks individually, encapsulate the work into a component, and share it with the team. In the end, the different steps can be combined into one workflow that uses the shared components in the right order.
Another nice side effect is that the individual components can be reused by other workflows. So if you have shared components that prepare your customer data for a churn prediction model, you can reuse them when working on a next best step model.
Workflow services (call a workflow from another workflow)
A newly introduced feature in KNIME Analytics Platform are the so-called workflow services. Like a component, a workflow service has clearly defined inputs and outputs. Unlike a component, however, a workflow service stands alone, and each input and output is defined by placing a Workflow Service Input or Workflow Service Output node into the workflow. The nodes have dynamic ports so that you can change their type by clicking on the three dots inside the node body. To make use of a workflow service, you can use a Call Workflow Service node. Once you point it to a workflow that has Workflow Service In- and Output nodes, the Call Workflow Service node can adapt to the present inputs and outputs and adjusts its own ports correspondingly. Now you just need to connect it, and when you run it, the workflow it points to will be executed. Data from your workflow is transferred to the callee workflow, and that in turn sends back its results.
![image20](./img/image20.png)
Workflow Services vs. Components
So when should you use workflow services, and when should you use components? The latter is copied to a workflow when added to it. All the nodes the component is composed of are now inside your workflow with their own configuration. Deleting the component from the repository does not change the behavior of your workflow at all. A workflow service, on the other hand, is merely referenced — the Call Workflow Service node tells the workflow what to call, but the exact nodes that will be executed are hidden within the workflow service. This has advantages and disadvantages: your workflow will be smaller because it contains fewer nodes, but the publisher of the workflow service can change their implementation without you noticing at all. When using shared components, you can decide whether you want to update them or not if something has changed in the referenced component. This is not possible with workflow services.
A component will also be executed as part of the workflow it is part of. A workflow service, on the other hand, is executed where it is located. A workflow service in your local repository will be executed locally, but if it resides in a KNIME Business Hub repository, this is where the service will run. This means that you can offload work from your local workflow to a Hub executor, or you can trigger multiple workflow services at the same time, and if the server has distributed executors for running workflows in parallel, those workflow services can process your data very quickly.
Workflow Services vs. Call Workflow and Container In- and Outputs
In addition to workflow services, the Call Workflow and various Container Input and Output nodes have been around for a while. The difference between those and the new workflow services is that the latter are meant for usage within KNIME, while the former can be called from third-party applications. So why not use Call Workflow for everything? Those nodes transfer data by first converting it into JSON format, a text-based format that cannot deal with all KNIME data types and is rather large. Workflow services, on the other hand, use the proprietary binary format KNIME uses internally to represent its data. This means no conversion is necessary and data size is very small. This leads to faster execution at the expense of shutting third-party applications out, as they do not understand the data formats.
Use Cases for Workflow Services
Workflow services are always a good idea when you want to split a project up into smaller reusable workflows. When using many components inside a workflow, this can make the workflow rather huge, resulting in slow workflow loading. Putting the logic into workflows that are called can decrease the workflow size and therefore loading time considerably.
A particular use case for workflow services is providing other users of KNIME Analytics Platform with the ability to offload compute intensive tasks to a KNIME Hub. When someone builds a workflow locally but would like to train something like a deep learning model but does not have a powerful GPU to do so efficiently, they may send their data to a workflow service that receives the data and hyperparameters as an input, trains the model, and outputs the final result.
But workflow services can also be used to provide access to additional data sources. Data only available from the Hub may be exposed to clients by a workflow service. Because the service can do checks and transformations within the workflow, it is easy to build it in a way that does not send out any confidential data. You could, for example, anonymize data before giving it to clients, or you could filter out certain rows or columns. The workflow service essentially acts as an abstraction layer between the actual data source and the client.
Glossary
Workflow annotation
A box with colored borders and optional text that can be placed in a workflow to highlight an area and provide additional information. Right-click on an empty spot in a workflow and select “New workflow annotation” to create a new workflow annotation at that position.
Node label
The text underneath a node. Select the node and click on the text. Then you can edit it and add a description of what the node does.
Workflow description
A description of a workflow that can be edited by the user. Click on the pencil icon at the top-right to edit it. The workflow description is also shown in the KNIME Hub when a user opens the workflow’s page.
Job
Every time a workflow is executed ad hoc or a deployment is executed a job is created on KNIME Hub. More information about jobs can be found here.
Data app deployment
A workflow comprising one or more components that contain view and widget nodes a user can interact with in the browser. A Data App is deployed on KNIME Business Hub. A guide on building Data Apps can be found here.
Schedule deployment
Workflow service
A workflow with Workflow Service Input and Workflow Service Output nodes that can be called from other workflows using the Call Workflow Service node.