Building Bot. Part One: Scrapping website

Now you often hear about bots and that this is future, so I decided to check it out and try to build a Telegram Bot. I’m going to make a bot which could find me a recipe from given product list. For example, those which I have in my fridge.

First of all, I was need to find a good database with recipes and if wasn’t hard because I decided just to scrap it from one of my favourite websites 

To fill database it’s one-time task, so I decided to keep it simple, so if you’d just google ‘java website scrapper’ you quickly find the best option – JSoup library.

Jsoup is quite simple and lets you use standard CSS selectors to navigate through the DOM, so you are free to use #id, .class and all other selectors.

First of all, we need to implement models to structure data.

I have three entities: Recipe which has an array of Ingredients and array of Directions (Steps to cook a dish). I made really simple POJOs, which I won’t describe here.

Then we need to figure out the website structure. In my case it not so complex. They have /recipe/page{N} and 20 recipe headers on each page. As we get 20 links from each recipe, we should follow it and parse entire recipe.

int page = 1;
while(true) {
    Document doc = Jsoup.connect(""+page).get();
    Elements links =".b-recipe-widget__name a[href]");
    for (Element link : links) {
        List<String> urls = new LinkedList<>();
        Observable<String> recipeUri = Observable.from(urls);
        recipeUri.subscribe(new GrabAction());

Here we have infinite loop though pages (I just let it break after reaching the last page) and it takes URLs of recipe pages using .b-recipe-widget__name a[href] selector, which means that we take href attribute from the link placed in the element with class .b-recipe-widget__name. Then it makes rx.Observable and subscribes to it with GrabAction which parses all information. GrabAction implements rx.functions.Action1<String> because it takes only one argument URL, obviously, and parses the page.

class GrabAction implements Action1<String>{

  public void call(String s) {
    Document doc = Jsoup.connect(""+s).get();
    Recipe recipe = new Recipe();"h1").first().text();
    List<Ingredient> ingredients = new ArrayList<>();
    Elements ingr ="tr.ingredient");

    for(Element i: ingr){
      String val ="td .name").text();
      String q ="td .amount").text();
      ingredients.add(new Ingredient(val, q));

This is part of GrabAction which illustrates how to parse values from the website and store it to object.

I store it to Mongo and next time I’ll write about codecs which you use to serialize/deserialize object to BSON.