Microservices tutorial building a web-scraper with Java, Python and RabbitMQ (updated)

Summary: A sample web scraping service demonstrating how to build a message driven application using RabbitMQ. The application consists of three parts: front-end developed in Knockout.js, that is communicating with a Spring Boot Java api, which in turn is offloading scraping tasks to a Python microservice.

Dear reader, I have created a course based on this course. If you enjoy the tutorial, you’ll love the class

Let's develop a message-driven microservices application

Learn how to build scalable applications using multiple frameworks and languages in one knowledge-packed crash course
  • Follow the complete development cycle when we go from idea to finished application.
  • Learn the essentials of single-page frontends (Knockout.js), REST based backends (Java-Spring) and microservices (Python, RabbitMQ) in one compact course.
  • Made for the busy developer. Ships with virtual machine image and tutor app so you can focus 100% on honing your coding skills.

In this post, I’m going to show how to implement a message-driven application that consists of three independent parts. The application is a web scraping service that takes an URL and returns a text summary of the site.

This is an example that shows one advantage of the microservices1 pattern: it is easy to implement an independent part of an application in a programming language that is most suitable for the task at hand. It so happens that for Python exists a good website summary library, which we are going to use to extract a short summary of a website. The main backend functionality, however, is implemented in Java using Spring Boot (personally, I prefer using the JVM as the core of any application).

Tools and frameworks I use are:

  • Spring Boot: REST, JPA, AMQP
  • Gradle
  • Java 8
  • Python with Pika and sumy
  • RabbitMQ
  • Postgres
  • Vagrant
  • Knockout.js, Bootstrap, Node

This project is an enhancement of a previous project that implemented a bookmark web app: Rapid prototyping with Spring Data Rest and Knockout.js, by adding scraping functionality to the bookmark service. Because of this, I won’t repeat what I have already explained there (mainly the REST API, JPA persistence, CORS and the details of the knockout.js front-end). If interested, have a look at the post.

As usual, the project is available on Github2.

Installation & run

Using vagrant

I’m using Vagrant to install Postgres and RabbitMQ on a virtual machine. All you need to do is vagrant up (requires Virtualbox installed and some hard drive space for the image).

Note: if you want to get rid of the vagrant virtual machine, just run vagrant destroy.

Note: to make things easier, in application.properties the line spring.jpa.hibernate.ddl-auto=create allows us to automatically create the tables when starting up. However, this means all tables get erased with each new start. Change the line to spring.jpa.hibernate.ddl-auto=validate (after you have started up the backend at least once) to avoid data loss.

Running the application

To run the application, clone/fork the repository.

The project has three parts that are independently deployed and executed:

scraping-microservice-java-python-rabbitmq
├── java-api-backend
├── knockout-frontend
└── python-scraping-service
Vagrantfile

Start vagrant, and then each service separately:

$ cd scraping-microservice-java-python-rabbitmq

scraping-microservice-java-python-rabbitmq$ vagrant up    
scraping-microservice-java-python-rabbitmq$ cd java-api-backend

java-api-backend$ gradle run
java-api-backend$ cd ../knockout-frontend

knockout-frontend$ python -m SimpleHTTPServer 8090
knockout-frontend$ cd ../python-scraping-service

python-scraping-service$ pip install -r requirements.txt
python-scraping-service$ python worker.py

Note: vagrant up takes a while the first time it’s executed. pip install -r requirements.txt is preferably done in a virtualenv3.

Also, for this post I’m using Python 2.

Application architecture

The following picture depicts how the main parts work together:

Webscraper architecture

RabbitMQ in 5 Minutes

RabbitMQ is an open source “message broker” software that implements the Advanced Message Queuing Protocol (AMQP)4. The basic elements of AMQP are (quoting from RabbitMQ’s official documentation5):

  • The producer is the sender of a message.
  • The consumer is the receiver of a message. A consumer mostly waits to receive messages.
  • A queue is the name for a mailbox. It lives inside RabbitMQ. Although messages flow through RabbitMQ and your applications, they can be stored only inside a queue. A queue is not bound by any limits, it can store as many messages as you like ‒ it’s essentially an infinite buffer. Many producers can send messages that go to one queue, many consumers can try to receive data from one queue.
  • An exchange. The core idea in the messaging model in RabbitMQ is that the producer never sends any messages directly to a queue. Instead, the producer can only send messages to an exchange. It is a very simple thing: on one side it receives messages from producers and the other side it pushes them to queues. How it does this is defined by the exchange type. There are a few available: direct, topic, headers and fanout.
  • A binding tells the exchange to what queue it should send messages to. Bindings can take an extra routing key parameter, the binding key. The meaning of a binding key depends on the exchange type. The fanout exchange, for example, simply ignores its value. A direct exchange a message goes to the queues whose binding key exactly matches the routing key of the message.
  • Publishing a message in RabbitMQ therefore takes a routing key that an exchange uses to match corresponding queues.

Our application is going to use one special exchange that simplifies message routing: the default exchange. This is a nameless exchange that automatically routes messages to queues where the routing key of the message matches the binding

Spring Boot Java API backend

We start with the main api that the front-end communicates with and that is responsible for storing bookmarks and sending new tasks to the scraping service.

(As mentioned, I’m only going to explain how to implement RabbitMQ messaging in Spring. If you are interested to learn about the REST API and database (JPA) persistence part, I have blogged about how to rapidly develop a REST api using Spring data).

Spring has a boot starter module available for RabbitMQ6 org.springframework.boot:spring-boot-starter-amqp, which we add to our dependencies.

Spring Java-based configuration: main class

As usual for any Spring Boot application, most configurations are automatically setup depending on what dependencies are on the class path and what beans are found.

Every Spring boot application starts with @SpringBootApplication annotation, which is a convenience annotation that adds all of the following7:

  • @Configuration tags the class as a source of bean definitions for the application context.
  • @EnableAutoConfiguration tells Spring Boot to start adding beans based on classpath settings, other beans, and various property settings.
  • @ComponentScan tells Spring to look for other components, configurations, and services in the same and sub-packages (because of that, it is recommended to put the main application class in the root package).
  • Depending on project dependency and bean definitions, further annotations will be added (e.g. @EnableWebMvc for a Spring MVC app when spring-webmvc is on the classpath).

The starting point of the backend API is, therefore, the base class with our main method:

src/main/java/scraper/api/ScraperApiApplication.java
@SpringBootApplication
public class ScraperApiApplication
{

    public static void main(String[] args)
    {
        SpringApplication.run(ScraperApiApplication.class, args);
    }

    // ... more bean definitions 
}

From there, Spring searches for further @Configuration annotated classes.

(By the way, there is no XML configuration to be found whatsoever in this project. Spring Boot prefers (and I too) Java-based configuration8, even though it is possible to mix the two.)

Task producer

As shown in the architecture picture above, the API backend is a producer of tasks that the scraper is consuming. Let’s start with the producer configuration. It extends the RabbitMqConfiguration which we are going to re-use later for the consumer part:

src/main/java/scraper/api/amqp/TaskProducerConfiguration.java
@Configuration
public class TaskProducerConfiguration extends RabbitMqConfiguration
{
    protected final String tasksQueue = "tasks.queue";

    @Bean
    public RabbitTemplate rabbitTemplate()
    {
        RabbitTemplate template = new RabbitTemplate(connectionFactory());
        template.setRoutingKey(this.tasksQueue);
        template.setQueue(this.tasksQueue);
        template.setMessageConverter(jsonMessageConverter());
        return template;
    }

    @Bean
    public Queue tasksQueue()
    {
        return new Queue(this.tasksQueue);
    }
}

@Configuration
public class RabbitMqConfiguration
{
    @Bean
    public ConnectionFactory connectionFactory()
    {
        CachingConnectionFactory connectionFactory = new CachingConnectionFactory("192.168.22.10");
        connectionFactory.setUsername("user");
        connectionFactory.setPassword("password");
        return connectionFactory;
    }

    @Bean
    public AmqpAdmin amqpAdmin()
    {
        return new RabbitAdmin(connectionFactory());
    }


    @Bean
    public MessageConverter jsonMessageConverter()
    {
        return new Jackson2JsonMessageConverter();
    }
}

We also define a MessageConverter and return a JSON based converter. With this, the messages we send to the queues are automatically converted to Json messages. This allows us to exchange simple Java “POJOs” (that are serialized into JSON) between our services.

The actual sending of messages is done with the configured RabbitTemplate. We implement a simple class that does the sending:

src/main/java/scraper/api/amqp/TaskProducer.java
@Component
public class TaskProducer
{
    @Autowired
    private TaskProducerConfiguration taskProducerConfiguration;

    public void sendNewTask(TaskMessage taskMessage)
    {       
    taskProducerConfiguration.rabbitTemplate()
                .convertAndSend(taskProducerConfiguration.tasksQueue, taskMessage);
    }

}

Now you might be wondering:

So what about exchanges, queues, bindings??

It’s all been defined already, mostly implicitly. We don’t explicitly define an exchange, so per default, we are using the defaultDirectExchange (see above). Same with binding, as we don’t say anything the default binding is being used (matching router key with binding key). We only define the queue. In convertAndSend(taskProducerConfiguration.tasksQueue, taskMessage) we use the name of the queue as routing key so the default direct exchange sends the message to the queue of the same name. That’s all we have to configure to make messaging work.

ScrapingResult consumer

Looking back at the architecture picture, we see that the api is also a consumer of the ScrapingResult queue. Again we base the consumer configuration on RabbitMqConfiguration

src/main/java/scraper/api/amqp/ScrapingResultConsumerConfiguration.java
@Configuration
public class ScrapingResultConsumerConfiguration extends RabbitMqConfiguration
{
    protected final String scrapingResultQueue = "scrapingresult.queue";

    @Autowired
    private ScrapingResultHandler scrapingResultHandler;

    @Bean
    public RabbitTemplate rabbitTemplate() {
        RabbitTemplate template = new RabbitTemplate(connectionFactory());
        template.setRoutingKey(this.scrapingResultQueue);
        template.setQueue(this.scrapingResultQueue);
        template.setMessageConverter(jsonMessageConverter());
        return template;
    }

    @Bean
    public Queue scrapingResultQueue() {
        return new Queue(this.scrapingResultQueue);
    }

    @Bean
    public SimpleMessageListenerContainer listenerContainer() {
        SimpleMessageListenerContainer container = new SimpleMessageListenerContainer();
        container.setConnectionFactory(connectionFactory());
        container.setQueueNames(this.scrapingResultQueue);
        container.setMessageListener(messageListenerAdapter());
        return container;
    }

    @Bean
    public MessageListenerAdapter messageListenerAdapter() {
        return new MessageListenerAdapter(scrapingResultHandler, jsonMessageConverter());
    }
}

The template and queue definition is almost the same except a different name. A consumer is not actively sending something but waiting to receive a message. Therefore, we define message listener. To do that in Spring, we define a MessageListenerAdapter bean that we add to a SimpleMessageListenerContainer. The actual listener implementation is again a simple Java POJO class:

src/main/java/scraper/api/amqp/ScrapingResultHandler.java
@Component
public class ScrapingResultHandler
{
    public void handleMessage(ScrapingResultMessage scrapingResultMessage)
    {
        System.out.println("Received summary: " + scrapingResultMessage.getSummary());
        final String url = scrapingResultMessage.getUrl();
        final List<Bookmark> bookmarks = bookmarkRepository.findByUrl(url);
        if (bookmarks.size() == 0)
        {
            System.out.println("No bookmark of url: " + url + " found.");
        }
        else
        {
            for (Bookmark bookmark : bookmarks)
            {
                bookmark.setSummary(scrapingResultMessage.getSummary());
                bookmarkRepository.save(bookmarks);
                System.out.println("updated bookmark: " + url);
            }
        }
    }
}

In that listener, we get the summary from the message and update the bookmarks for that URL.

Json message converter type

One question remains: how does the JSON message converter know that it has to deserialize the message it received from RabbitMQ into scraper.api.amqp.ScrapingResultMessage? This is normally accomplished by setting a header of the message: __TypeId__ = "scraper.api.amqp.ScrapingResultMessage". This is fine if only Java services communicate with each other. But in our case we want completely independent parts, the Python scraper should not know anything about how the Java API backend consumes messages. For this, we override the default behavior by providing a ClassMapper in the RabbitMQ configuration:

src/main/java/scraper/api/amqp/RabbitMqConfiguration.java
    @Bean
    public MessageConverter jsonMessageConverter()
    {
        final Jackson2JsonMessageConverter converter = new Jackson2JsonMessageConverter();
        converter.setClassMapper(classMapper());
        return converter;
    }

    @Bean
    public DefaultClassMapper classMapper()
    {
        DefaultClassMapper typeMapper = new DefaultClassMapper();
        typeMapper.setDefaultType(ScrapingResultMessage.class);
        return typeMapper;
    }

With typeMapper.setDefaultType(ScrapingResultMessage.class); we tell Spring that all messages we expect to consume are of type scraper.api.amqp.ScrapingResultMessage. We set this to be true for the whole application because that is all we consume. Later we might want to move that into a specific consumer configuration (e.g. ScrapingResultConsumerConfiguration).

Testing the API backend

That’s all for the Java API part. Cd into the project and run gradle run. The api connects to to Rabbit Mq and Postgres. Test the api by sending JSON requests to it (as explained in the previous post. Adding a bookmark will also send a new message to RabbitMQ.

java-api-backend$ http :8080/
{
    "_embedded": {}, 
    "_links": {
        "bookmarks": {
            "href": "http://localhost:8080/bookmarks", 
            "templated": false
        }, 
        "profile": {
            "href": "http://localhost:8080/alps", 
            "templated": false
        }
    }
}

java-api-backend$ http POST :8080/bookmarks url=www.yahoo.com
HTTP/1.1 201 Created

java-api-backend$ http :8080/bookmarks
{
    "_embedded": {
        "bookmarks": [
            {
                "_embedded": {}, 
                "_links": {
                    "self": {
                        "href": "http://localhost:8080/bookmarks/1", 
                        "templated": false
                    }
                }, 
                "created": "2015-08-06T14:00:08.574+0000", 
                "summary": null, 
                "url": "www.yahoo.com"
            }
    ...
}

The RabbitMQ web admin interface is available at: http://192.168.22.10:15672/. Login with “root” “root” and under “Queues” > “Get messages” you can query the new message:

Message in RabbitMQ

(note the __TypeId__ header, see above).

We can also test the consumer. Create a message in RabbitMQ web admin. Select the queue and select “publish message”. Enter following:

Publish message in RabbitMQ

The log of the backend will show something like that:

Received unit: a search engine
...
Hibernate: update bookmark set created=?, summary=?, url=? where id=?
updated bookmark: www.yahoo.com

Using REST api:

$ http :8080/bookmarks
{
    "_embedded": {
        "bookmarks": [
            {
                "_embedded": {}, 
                "_links": {
                    "self": {
                        "href": "http://localhost:8080/bookmarks/1", 
                        "templated": false
                    }
                }, 
                "created": "2015-08-07T08:41:45.458+0000", 
                "summary": "a search engine", 
                "url": "www.yahoo.com"
            }
        ]
    }, 
    "_links": {
        "search": {
            "href": "http://localhost:8080/bookmarks/search", 
            "templated": false
        }
    }
}

Now let’s get polyglot and continue with the Python scraper service.

Python scraper

The RabbitMQ part is simpler to implement in Python. The RabbitMQ home page has a nice set of tutorials if you want to learn more. We use Pika as our binding to RabbitMQ.

worker.py
import pika
import json
from scraper import Scraper

credentials = pika.PlainCredentials("user", "password")
parameters = pika.ConnectionParameters(host='192.168.22.10', credentials=credentials)

connection = pika.BlockingConnection(parameters)
channel = connection.channel()
tasks_queue = channel.queue_declare(queue='tasks.queue', durable=True)
scraping_result_queue = channel.queue_declare(queue='scrapingresult.queue', durable=True)

print ' [*] Waiting for tasks. To exit press CTRL+C'

def publish_result(scraping_result):
    j = json.dumps(scraping_result.__dict__)
    properties = pika.BasicProperties(content_type="application/json")
    channel.basic_publish(exchange='', routing_key='scrapingresult.queue', body=j, properties=properties)

def callback(ch, method, properties, body):
    url = json.loads(body)['url']
    scraper = Scraper()
    result = scraper.scrape(url)
    publish_result(result)

channel.basic_consume(callback, queue='tasks.queue', no_ack=True)
channel.start_consuming()

We create the two queues and wait for the task queue to send us a message to consume. The callback method receives the JSON body which we deserialize using json package. We send the url to our scraper and produce a message in the scrapingresult.queue queue.

For the actual scraping of a website’s summary, we use https://github.com/miso-belica/sumy library. Please see their docs (especially as it requires to have nltk installed on your system).

scraper.py
class ScrapingResult:
    def __init__(self):
        self.url = None
        self.summary = None

LANGUAGE = "english"
SENTENCES_COUNT = 2

class Scraper:

    def scrape(self, url):
        complete_url = url
        try:
            # get summary
            print "Retrieving page summary of %s... " % url

            parser = HtmlParser.from_url(complete_url, Tokenizer(LANGUAGE))
            stemmer = Stemmer(LANGUAGE)

            summarizer = Summarizer(stemmer)
            summarizer.stop_words = get_stop_words(LANGUAGE)

            url_summary = ''.join(str(sentence) for sentence in summarizer(parser.document, SENTENCES_COUNT))

        except Exception, e:
            url_summary = "Could not scrape summary. Reason: %s" % e.message

        print "Done: %s = %s" % (url, url_summary)

        # create scraping result
        scraping_result = ScrapingResult()

        scraping_result.summary = url_summary
        scraping_result.url = url

        return scraping_result

The scraper is not doing much except using the sumy library and returning the result.

To run the python scraper, install requirements.txt (preferably in a virtualenv) and then simply run worker.py:

(venv)python-scraping-service$ python worker.py 

Frontend

For the frontend we re-use the Knockout.js project from the previous post and basically just add a new field for the summary part.

To start the frontend we use the SimpleHTTPServer from Python. We need to choose another port as 8080 is already taken by the backend API.

knockout-frontend$  python -m SimpleHTTPServer 8090

The bookmark microservice in action

Now we can add bookmarks and observe the processing in the logs:

Adding a bookmark on the web site

Backend log:

Hibernate: select nextval ('hibernate_sequence')
Hibernate: insert into bookmark (created, note, summary, url, id) values (?, ?, ?, ?, ?)

That sends a message to the queue that gets picked up and processed by the scraper:

Retrieving page summary of http://www.clojure.org ... 
Done: http://www.clojure.org  = It is designed to be a general-purpose language, ...

On the backend, the listener receives the result message and updates the bookmark

Hibernate: select bookmark0_.id as id1_0_, bookmark0_.created as created2_0_, bookmark0_.note as note3_0_, bookmark0_.summary as summary4_0_, bookmark0_.url as url5_0_ from bookmark bookmark0_
Received summary: It is designed to be a general-purpose language....
...
Hibernate: update bookmark set created=?, note=?, summary=?, url=? where id=?
updated bookmark: http://www.clojure.org 

On the frontend we refresh the website and get:

Updated bookmark on the web site

(In a real-world application, we would need to push the updated info to the client (using web sockets for example). For our prototype we skip this (this post has already become quite long) and require the user to hit refresh.)

Source code on Github

Check out the source code on Github.

The video course

Let's develop a message-driven microservices application

Learn how to build scalable applications using multiple frameworks and languages in one knowledge-packed crash course
  • Follow the complete development cycle when we go from idea to finished application.
  • Learn the essentials of single-page frontends (Knockout.js), REST based backends (Java-Spring) and microservices (Python, RabbitMQ) in one compact course.
  • Made for the busy developer. Ships with virtual machine image and tutor app so you can focus 100% on honing your coding skills.

Thanks for reading

I hope you enjoyed reading this post. If you have comments, questions or found a bug please let me know, either in the comments below or contact me directly.

Resources

Subscribe to Human Intelligence Engineering
Explorations of human ingenuity in a world of technology.