Calling RapidMiner from Python
I recently discovered that RapidMiner (RM) has a python library that allows us to start an RM process in order to perform various actions. While this may appear counter-intuitive, I am interested in exploring how I might be able to write “tests” that let me evaluate the process submitted by a student as a homework assignment. Simply, if I were teaching a python-only course, I might use
pytest to evaluate and autograde an assignment in order to provide immediate feedback to my students. With this library, I am hopeful that I might be able to come up with a comparable solution for when I teach with RapidMiner.
The code snippets below are mostly adapted from the following resource, but with some additional context added. First, getting setup is easy, as it’s a simple
pip install rapidminer. The repository can be found here, and is something that you should explore, as RapidMiner has been working on “operators” that allow us to easily include tooling backed by scikit-learn in python. That is, we can use scikit-learn inside a RapidMiner process.
After installing the package, I suspect that the hardest part will be the configuration to ensure that you inform the python library where RapidMiner is installed on your machine.
rm_home = "/Applications/RapidMiner Studio.app/Contents/Resources/RapidMiner-Studio/"
connector = rapidminer.Studio(rm_home)
Above establishes the
rm_home variable and points to appropriate folder on my macbook. It is worth noting that instead of passing a string, the python library will also look for an environment variable
RAPIDMINER_HOME instead of explicitly defining the location in your script.
If you dive into the python library’s source code, you will notice that
Studiois looking for a folder called
scriptswithin the installation directory.
connector established, let’s run through some basic use-cases.
Save a dataset to a RapidMiner Repository
Before diving into the code, let’s get some terminology out of the way. RapidMiner has the notion of
repositories. These are simply folders on your machine that will act as the container for your work. Once you have a repository setup, you can reference these repositories via
//repository. You will see that in the code below.
from ast import operator
from concurrent.futures import process
import pandas as pd
URL = "https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/diamonds.csv"
dia = pd.read_csv(URL)
Let’s step through above:
- We import a number of libraries, including pandas. I would recommend using an package manager like conda. In my case, I create an environment called
rapidminerand pip-install all of the tools that I need there. I also use this environment within RM when calling python.
URLis simply a pointer to a csv file on the web
- I use pandas to read the csv file into a DataFrame, a python object that RapidMiner can easily work with.
- Using the
connectorobject, I am writing the dataset (called diamonds) into the
datasub-folder in my repository called
BU. Note the
//BUshortcut I referenced above.
- You should now see the data object within the repository of your choice. TIP: You may need to refresh the view within RM Studio to see the change.
While RapidMiner also has the concept of a
project (i.e. a git-backed folder), I tend to simply create repositories in RM that point to folders that are already under version control on my machine. I prefer to use other tools (e.g. CLI, Github Desktop) to commit/push my diffs instead of doing these tasks within RM.
Save a dataset to a RapidMiner Repository
Now this where the functionality gets exciting. The code below will use the same
connector, but this time, it is pointed at a RapidMiner process stored in the repository.
hw = connector.run_process("//BU/hw1")
A few notes:
- The process is stored in my repository and is called
- Based on the configuration above, the entire RapidMiner process will be executed. We will see another example below where we can execute a specific operator within the process.
- Python will invoke RapidMiner and start the engine in order to execute the process. This will take a few moments depending on your machine and installed extensions. NOTE: You can tell python whether or not you want to see the messages while RM is getting started and executing the process flow.
- Above you can notice that the I am assigning the output of the process to a variable called
hw. The results stored in this python object will be dependent on the process setup. Above, my process simply
diamondsdataset from the earlier example, performs an aggregation, and connects the resulting
ExampleSetto the results port. As such the
hwvariable is a pandas DataFrame with the aggregated results.
The key takeaway from above is that we don’t have any knowledge of the steps in the process, or the work that was done along the way, but the python library allows us to extract the result ports and bring those objects back into our python session.
You might be wondering why I used the ExampleSet reference above. And ExampleSet is simply a dataset in RapidMiner, and in most cases from my experience, will be a pandas DataFrame, but it doesn’t have to be.
Call an Operator within the Process
Finally, the last task that appears to be supported is the ability to call a specific operator (e.g. a function or unit of work in RM) and get the output from that run. This is a touch tricky because I suspect that when calling this operator, we must properly pass the inputs the operator expects.
# using the dia object from earlier, only keep rows where the cut value is Good
dia2 = dia.loc[dia.cut=='Good', :]
# the Aggregate operator has two ouputs, as such, specifying two output objects
agg, ori = connector.run_process("//BU/hw1", inputs=dia2, operator="Aggregate")
A few notes on above:
- I am keeping a subset of the original diamonds dataset that we grabbed from the web. This is to test the output of the
Aggregateoperator, which is essentially a
group-byutility. In this case, there should only be one row since the process will be grouping by the variable, or
- We are still using the
run_processmethod, but now are specifying the inputs to use, and the name of the Operator within the process.
- We are also storing multiple output, as the
Aggregateoperator exports the results as an ExampleSet (that is coverted to a DataFrame), as well as the original ExampleSet passed to the operator.
How neat is that?
This is a great start, and considering that RapidMiner processes are simply XML files, I do wonder if I will be able to think about ways to write tests against my students’ assignments by parsing the XML files and then running various operators along the way.
This leads me to my wishlist:
- Instead of parsing the XML document, it would be nice if we could inspect both the process itself, as well as get feedback on the entire run of the process. Ideally we would be able to see the data flow through, understand the changes that occur along the way, etc. Having this information in python, after the call to
run_processwould enable us to write tests to ensure that a process is configured as expected, the output matches expectations, etc.
- In addition to running a single operator, it could be helpful to run a portion, or sequence of the process. The same rationale still applies; write tests against the process.