How to Clean the Twitter Data using R – Twitter Mining Tutorial

In the previous post, I have shown How to Fetch Twitter Data using R Programming. Before mining any kind of data we need to clean it and make it proper to apply mining technique. To mine the twitter data there are various inbuilt functions which we are going to use in this tutorial.

For the demonstration process I have created function named “clean_text” which takes the text document and process it.

How to Clean the Twitter Data using R – Twitter Mining Tutorial

Code of Clean_text() function

clean_text = function(x)
x = gsub("rt", "", x) # remove Retweet
x = gsub("@\\w+", "", x) # remove at(@)
x = gsub("[[:punct:]]", "", x) # remove punctuation
x = gsub("[[:digit:]]", "", x) # remove numbers/Digits
x = gsub("http\\w+", "", x)  # remove links http
x = gsub("[ |\t]{2,}", "", x) # remove tabs
x = gsub("^ ", "", x)  # remove blank spaces at the beginning
x = gsub(" $", "", x) # remove blank spaces at the end
try.error = function(z) #To convert the text in lowercase
y = NA
try_error = tryCatch(tolower(z), error=function(e) e)
if (!inherits(try_error, "error"))
y = tolower(z)
x = sapply(x, try.error)

To run the code you need to setup Twitter Connection and fetch some tweets. Below is the code where clean_text() function is called for text processing:

data1 <- searchTwitter("IPL", n=1000)
data1.text <- sapply(data1, function(x) x$getText())

To see the difference between pre and post processing of the tweet data see below:

Tweet Pre Processing:

[1] “twittFliy: RT @SirJadeja: #MI Won The Toss &amp; Elected To Bowl First. Who Will Win?\n\n#DDvMI #MIvDD #IPL2016 #IPL”

Same Tweet Post Processing:

rtmi won the toss amp elected to bowl first who will win\n\nddvmi mivdd ipl ipl

You can modify the cleaning function to process the text according to your need.

In the next Tutorial you can learn How to Create WordCloud from Twitter Data using R Programming.

For any query add your question in the comment box section below:

Leave a Reply