Thanks to your homework, you had acquired the basic knowledge that you needed to pursue and to hope to use the tool properly.
As already mentioned, it is a more intricate tool that every other thing studied so far; so let’s be attentive and take your time to assimilate everything.
We will have three subsections in this second final chapter ; two focused on theory about the tool, and the third about the hands-on of.
Firstly, the studying will concern the following on from your homework: it will consist of discovering the purpose, capabilities, definition, usefulness of RapidMiner, in a purely theoretical way.
Secondly, we will see how all the things highlighted in the first subsection can be extremely helpful, even required, for the future journalists who you are.
And thirdly, the last subsection and last studying of the whole course, will focus on the two parts of hands-on : how to, yourself, put your hands into the tool, to familiarize yourself in practice, to analyze documents and data with, in order to finally be entitled to use it in your future professional tasks.
The first hands-on will focus on a discovery of the tool and a familiarity with the basics; and the second one will be more challenging with an application of different processings and so on.
Let’s start!
- Definition, purpose, capabilities of RapidMinner
As you certainly already figured out because of your own research with your homework, RapidMiner is a data science software, created in 2006, that enables analysis of data, texts, machine learning, predictive analytics, and many other supports. It is a tool that enables you to do analysis that can sometimes be very technical, but without passing by codes or unreachable and intricate processing.
As said in a tutorial, RapidMiner is a “graphical tool that will help us conveniently hide the complexity of writing code and speed up our learning process”. So, on the contrary, it’s becoming, by this way, an approachable tool.
It is a software used “for business and commercial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the machine learning process including data preparation, results visualization, model validation and optimization”.
You can watch his introductive video : https://www.youtube.com/watch?v=Gg01mmR3j-g, to familiarize yourself with the first capabilities of ; for understanding a little bit of how it works, broadly speaking.
So far, the idea was just to introduce RapidMiner and its capabilities ; but more importantly than that, it’s how the use of RapidMiner is relevant for future journalists like you. Let’s move on to discover the answer, in the next subsection.
- How it would be very useful for future journalist to use it
You are future journalists, and so read articles, write articles, be immersed in the Word world, with the continual need to make comparisons between many articles – as well as true articles as fake news – in this insatiable seeking of the truth, it will be exactly the main purpose of your future job.
Everything studied since the beginning of the course had this aim to make you realize the role you will play in the accomplishment of your tasks, the responsibility behind the words because otherwise, they are able to be misunderstood or misinterpreted.
RapidMiner is a software that enables you to interrogate your choices behind the words, to emphasize a manner of writing with some repetitions, some expressions, punctuations and so on. It highlights some details that maybe you would have never dwelled on otherwise, in pointing out that comes often under informatic analysis, for lack of our mind to grasp it.
Useful, for future journalists like you, isn’t it?
The purpose of RapidMiner, is to improve either your writing skills, in passing some articles you would write ; or to improve your capabilities in analyzing articles from others: when you want either to prove that this one is a fake news for example, or to grasp the semantic itself, in order to figure out what is conveyed tacitly and inherently of that.
Useful, for future journalists like you, isn’t it?
It is such a huge field, and everyone can claim to be a journalist or at least someone legitimate enough to write and share about any subject. Rightly so, to feel overwhelmed under an excess of information, and to want to draw a path, your path, the most consistent as possible with your values, and to be aware of others’ work, to learn from them and become the analyst you should also be with your own work.
You will puzzle it out much better the ins and outs of RapidMiner in putting this theory into practice, very soon; just the next and final subsection, right below.
A pure hands-on, a pure practical work on which all the theory acquired is finally going to find its own and deep meaning.
- Familiarizing with the Software : analyzing, and so on
You definitely have a better insight of what RapidMinner is, don’t you?
The final idea, now, is to familiarize yourself with the tool, beyond the only overall view.
For starting, you can download – freely – the software, and choose the commercial version, and not the educational version.
Is joined with this course, another document of 30 pages ; it represents all the notes from two tutorials followed about RapidMiner and its capabilities. It is, most of the time, concrete examples of what you can do or operate, with precise explanations of the process to trail: importing the data, tokenizing words, classifying, extracting entities, stemming… And so on.
What you need to know, with RapidMiner, is the basics process you will always use, even for more complex future processing. And for that, there is below a summarization of what is important to know.
Please open at the same time the PowerPoint for the two hands-on, and the Quiz associated (the first part of the Quiz is still in relation with the previous chapter; and the second part is focused on RapidMiner).
Some explanations, from this PDF 30 pages, copying and pasting there to be clearer about a kind of glossary: some keywords and important points:
Operator:
Is a small task that does its own function.
Search : with keywords for example.
Very well organized and categorized.
Extensions : many possibilities. Some are free, others are charged.
Repository:
It’s for once you are done with your data science process and when we want to deploy it on the cloud to share it, to enable others to have access to your work.
Parameters:
Properties of operators you use ; or the process we’re creating ; or parameters of the entire process itself => where do I want to save the log file.
When you select an operateur : it shows the parameters of that operator.
Help:
Very useful : when you select an operator, there is a small description of what the operator is. Details about various parameters that operator has. Default values…
In most of them, they also give you a link to a tutorial process.
The button blue, in the top-left: to execute the process.
Just the one, next to: to interrupt the process.
You also have 2 kinds of possible views in RapidMiner:
- The design view: when you’re developing your process, you are building your project
- The results view: when you run the process, results will be shown in the results for you
Total occurrence : the number of times the word occurred in all of your text or all of your document.
Document occurrence : means the instances in which we have the word that is used and occurs.
A little theory about preprocessing:
- Transform cases : if we have “FREE”, or “Free”, or “free” : if we don’t transforme them, during the processing, they will be considered as 3 different words. That’s why it’s something we have to change when we are processing the document. => we have to transform all of our words into lowercase, to be sure they will be counted as one.
- Stemming : we have these two words in our text : “organizing”, and “organized”, the two will be recognized as two different words. But both of them come from the same root word: “organize”. And because of the same root word, we want to count them as the same word. Stemming is the process that allows us to do this: we just turn them when they go back to the root word so that they can be counted as the same word.
Each time, it starts with downloading data, or text or whatever else ; the support you want to analyse.
And then:
- Always, find the operator you want to use, according to what you want to do.
=> It can simply be “read document” to make your document appear.
=> “tokenize”: if you want to break your text into individuals words
=> “extract token number” : once tokenize breaks a word, it will count how many times that word occurs.
=> “extract length”: to remove all the punctuation.
=> “transform cases”: studied just above: will transform all into “lower case”.
=> “stem”: same, will turn words whose root is the same into only one.
- Choose in parameters the properties you want to give in your operator
For this part, there is not much more to specify here, the PDF document of the course, because it is an exercise that represents practice; so it supposes an accompaniment throughout each process, that it is the case with the two PowerPoint to follow at the same time.