Home » Questions » Computers [ Ask a new question ]

Options for HTML scraping? [closed]

Options for HTML scraping? [closed]

"Closed. This question needs to be more focused. It is not currently accepting answers.












Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 8 years ago.





Improve this question





I'm thinking of trying Beautiful Soup, a Python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well.
The story so far:

Python

Beautiful Soup
lxml
HTQL
Scrapy
Mechanize

Ruby

Nokogiri
Hpricot
Mechanize
scrAPI
scRUBYt!
wombat
Watir

.NET

Html Agility Pack
WatiN

Perl

WWW::Mechanize
Web-Scraper

Java

Tag Soup
HtmlUnit
Web-Harvest
[jARVEST] 21
jsoup
Jericho HTML Parser

JavaScript

request
cheerio
artoo
node-horseman
phantomjs

PHP

[Goutte] 29
htmlSQL
PHP Simple HTML DOM Parser
PHP Scraping with CURL
ScarletsQuery

Go

goquery
Dataflow kit

Most of them

Screen-Scraper"

Asked by: Guest | Views: 306
Total answers/comments: 3
Guest [Entry]

The Ruby world's equivalent to Beautiful Soup is why_the_lucky_stiff's Hpricot.
Guest [Entry]

"In the .NET world, I recommend the HTML Agility Pack. Not near as simple as some of the above options (like HTMLSQL), but it's very flexible. It lets you maniuplate poorly formed HTML as if it were well formed XML, so you can use XPATH or just itereate over nodes.

http://www.codeplex.com/htmlagilitypack"
Guest [Entry]

BeautifulSoup is a great way to go for HTML scraping. My previous job had me doing a lot of scraping and I wish I knew about BeautifulSoup when I started. It's like the DOM with a lot more useful options and is a lot more pythonic. If you want to try Ruby they ported BeautifulSoup calling it RubyfulSoup but it hasn't been updated in a while.