First steps in Machine learning. Part one. Preparing data

Some time ago, Google presented their Deep Learning algorithm called TensorFlow. The whole world has been playing with that deep dream experiment. I was not an exception.

Some time ago, Google made a Deep Learning course on Udacity I wasn’t familiar with all this complex context, so my friend recommended to take Intro to Machine learning at first. This weekend I completed three lessons, it was interesting, not so complex, except maybe, Python language, which I’ve never used.

I’ve found, that I can play a bit with Naive Bayes, it was a topic of first lesson and assignment. In short, this is most simple Machine learning algorithm, which is suitable to analyse simple text data.

After the revolution, Ukrainian tweeter has an interesting concept. Loads of tweets are marked with hashtags #зрада and #перемога which mean #betray and #win accordingly.Hard to explain, but it just like labels, when people don’t like what happens, they say “that is betray!” or opposite “It’s epic win!”

This, in my opinion, fits my task just perfect.

I decided to collect tweets marked with #win and #betray, extract the answer, clean data, train my model and classify tweets even without hashtags.

The first challenge was to find data.

Appeared, that all deep twitter archives are not free. I’ve found nice free script to archive tweets. Initially, this idea came to me in December 2015, so I had about three months to collect it. Sad, but true.

I collected data, what to do with it?

Clean it. To be honest, I don’t know yet how to do it in a right way, so it took me some hours.

  • at first, I extracted only tweet text – just removed not relevant columns.
  • Then, I found that Numbers on MacOS can’t perform search on regular expressions, so I copied data to Sublime Text and removed all hyperlinks with regEx search
    (https?:\/\/(?:www\.|(?!www))[^\s\.]+\.[^\s]{2,}|www\.[^\s]+\.[^\s]{2,})
  • Make everything lowercase
  • Deduplication. I can’t believe, there is no easy way to deduplicate data in MacOS Numbers. Decision: open shell and do

       sort text_only_tweets.csv | uniq > uniq_tweets.csv

  • At this stage, I copied data back to Numbers and made a couple of formulas to extract labels for data. I added one column and put 0 if tweet contains #win and 1 if tweet contains #betray. If it contains both tags, then it should be removed as not relevant.
  • Remove all #win and #betray words from tweets, as they has too literal explanation and it could break our analysis.
  • Then all symbols like . , ? ! @ ^ * # etc, should be removed, of course, it’s possible to use RegEx, but I just did it one by one as I find it in text.
  • I replaced all symbols with space, so now I replaced all double-space to single-space multiple times
  • The final stage of preparing data: Export table to CSV and replace spaces with commas and then copy back to Excel.

As a result, I have CSV file that looks like this. The last column is label 0 or 1

data

Just after archiving tweets, I got about 15000 lines, after removing duplicates, empty lines, links, unuseful data I got about 3800 lines. Will see if it’s enough to at least past tests.

Next time I’ll train my model and count its accuracy. Of course, I’ll write a post about it