Building Bot. Part One: Scrapping website

Now you often hear about bots and that this is future, so I decided to check it out and try to build a Telegram Bot. I’m going to make a bot which could find me a recipe from given product list. For example, those which I have in my fridge.

First of all, I was need to find a good database with recipes and if wasn’t hard because I decided just to scrap it from one of my favourite websites eda.ru 

To fill database it’s one-time task, so I decided to keep it simple, so if you’d just google ‘java website scrapper’ you quickly find the best option – JSoup library.

Jsoup is quite simple and lets you use standard CSS selectors to navigate through the DOM, so you are free to use #id, .class and all other selectors.

First of all, we need to implement models to structure data.

I have three entities: Recipe which has an array of Ingredients and array of Directions (Steps to cook a dish). I made really simple POJOs, which I won’t describe here.

Then we need to figure out the website structure. In my case it not so complex. They have /recipe/page{N} and 20 recipe headers on each page. As we get 20 links from each recipe, we should follow it and parse entire recipe.

int page = 1;
while(true) {
    Thread.sleep(2000);
    Document doc = Jsoup.connect("http://eda.ru/recepty/page"+page).get();
    Elements links = doc.select(".b-recipe-widget__name a[href]");
    for (Element link : links) {
        List<String> urls = new LinkedList<>();
        urls.add(link.attr("href"));
        Observable<String> recipeUri = Observable.from(urls);
        recipeUri.subscribe(new GrabAction());
    }
    page++;
}

Here we have infinite loop though pages (I just let it break after reaching the last page) and it takes URLs of recipe pages using .b-recipe-widget__name a[href] selector, which means that we take href attribute from the link placed in the element with class .b-recipe-widget__name. Then it makes rx.Observable and subscribes to it with GrabAction which parses all information. GrabAction implements rx.functions.Action1<String> because it takes only one argument URL, obviously, and parses the page.


class GrabAction implements Action1<String>{

  @Override
  public void call(String s) {
    Document doc = Jsoup.connect("http://eda.ru"+s).get();
    Recipe recipe = new Recipe();
    doc.select("h1").first().text();
    recipe.setName( doc.select("h1").first().text());
    List<Ingredient> ingredients = new ArrayList<>();
    Elements ingr = doc.select("tr.ingredient");

    for(Element i: ingr){
      String val = i.select("td .name").text();
      String q = i.select("td .amount").text();
      ingredients.add(new Ingredient(val, q));
    }
    recipe.setIngredients(ingredients);
    MongoWriter.getInstance().writeOne(recipe);
  }
}

This is part of GrabAction which illustrates how to parse values from the website and store it to object.

I store it to Mongo and next time I’ll write about codecs which you use to serialize/deserialize object to BSON.