Hey Karthikeyan, BeautifulSoup is a library that “parses” HTML or XML content. In other words, it reads your HTML file and helps extract content from it. Scrapy is a full blown web scraping framework. That means, it already has the functionality that BeautifulSoup provides along with that it offers much more.

When you are developing a web scraping system, you would need a way to send requests to the websites (probably using requests or urllib), you would need a way to send multiple requests at once(multiprocessing/asynchronous) so that you can download content faster. You would also need a way to export your downloaded content in various required formats, if you are working on large scale projects, you would require deploying your scraping code across distributed systems.

Hi Rizvi, Thank you very much for respoding to Ankit's (my colleague ) query.The issue is not in extracting text from pdf but in extracting the relevant info of the structure of the pdf(tables etc).In other words,no info on any way to identify the data as tabular or its structure in pdf document.What we are trying to do is to extract specific info (for eg specific column data from a table in pdf document).That's where most of the open source libraries falter.Reason looks to be more about the way pdf has been encoded. Hope,the query is clear.In case,you need additional info,pls let me know.Any help in this regard well be highly appreciated.Primarily,we are looking for Python APIs.Even if open source Java libraries can do the same,we can invoke the same from Python code. Great article but I'm a little surprised it didn't touch on the challenges of using Scrapy when trying to scrape JavaScript heavy websites. Most of the sites that I work with now require also using Splash to render the JavaScript. As such I've also started looking at the Selenium and WebDriver option. At first, I tried very hard to limit myself to only Scrapy and Splash but after a month working on a complicated site, I'm really wishing I would have changed approaches much earlier. I've done more in a few days with Selenium using the page object pattern than in weeks of Scrapy and Splash development.

As such I’ve also started looking at the Selenium and WebDriver option. At first, I tried very hard to limit myself to only Scrapy and Splash but after a month working on a complicated site, I’m really wishing I would have changed approaches much earlier. I’ve done more in a few days with Selenium using the page object pattern than in weeks of Scrapy and Splash development. Hey Charles, True that with the advent of JavaScript based front end frameworks and libraries, it is becoming difficult to scrape websites as such. We would have to use Selenium and Webdriver to aid in the part where we require user action like clicking a popup or filling a form. It’s not rare to see Scrapy applied in conjunction with Selenium in projects.

Yet, we have to remind ourselves that that’s not the problem Scrapy is meant to solve. You could argue web scraping is a domain of its own with sub domains, one such sub domain being dealing with dynamic/javascript heavy websites. This article’s goal was supposed to get a beginner started with web scraping especially with the use of Scrapy.

