Clipping Service News OCR

Introduction

Summary

As the owner of the biggest Brazilian media data set, the media monitoring leader Clipping Service was having issues with scalability, getting close to their data center maximum capacity.

Clipping Service operates on a huge scale, receiving around ~4.5K media press pages per day from roughly 300 newspapers in both digital and printed versions. Previously employees called “readers” were responsible for reading and clipping (adding highlight into the targeted content) to be later passed onto the “reviewers” team.

As if the burden of reading countless pages a day were not enough, the readers’ operation begins around 4:30 a.m. when the “first reading” begins (i.e., the delivery of the morning papers).

Problem

For over 20 years this content has been ingested by so-called “readers”. But due to the advent of the internet and the digital press boom at the end of the 90’s, and nowadays of social media, companies are transferring their clipping investments to monitoring other areas. Therefore requiring a Clipping Service action to remain competitive in the market.

Through news reading automation using OCR, NLP, and artificial intelligence to categorize media, the plan was to achieve a higher throughput during ingestion, giving readers more free time to review the content. Consequently achieving a higher quality in the content, since we as humans aren’t good at doing repetitive tasks, especially when it comes to reading endless pages searching for names and words.

Solution

Tech implementation

After spending some time researching and benchmarking the alternatives at hand we decided to use Python as the implementation language for handling texts, OCR, and NLP (using NLTK). Given its extended API and libraries for NLP and image processing.

As the cloud provider we choose AWS due to it’s stability and consistency over other vendors, the conclusion at the time was: AWS price estimative 14.67% greater than GCP. However, AWS’s popularity is greater than GCP and proven in terms of stability, support, and integrity. Making a safer choice for a slightly higher price.

The tech stack was: Python 3 using Dramatiq as the task processing library, running Tesseract OCR jobs, processing text with NLTK and images with Pillow (ImageMagick wrapper). Redis was the message broker for Dramatiq, a simple Postgres database stored metrics regarding the execution and we had an Elasticsearch storing the processed content.

Requests coming from the data center reached an API Gateway, responsible for executing a Lambda function, and delivering the content result.

The best part of the design? We stored and served the content via AWS S3. Each part was designed with fault tolerance, and we simply turned off the entire cloud infrastructure after the operation, to turn on only the next day.

Operating only from 4am to 2pm, a “serverless” and ephemeral project benefiting from an aggressive cloud cost reduction.

Impact and results

Clipping Service reduced its reading team workforce by ~78%, offering internal hiring for other areas of the company and a voluntary dismissal plan with benefits, making the process as human as possible for the former employees.

Using automation for reading tasks, Clipping Service could reach considerable improvements in the media press ingestion throughput (around 20 times faster), offering higher quality in press clipping service for its customers and saw the opportunity in creating a self-service press clipping service later, since the operational cost decreased significantly.

Matheus Cunha
Matheus Cunha
Systems Engineer and Magician

Just a technology lover empowering business with high-tech computing to help innovation (:

comments powered by Disqus