|
UnixReview.com
June 2007
Regular Expressions: Python's Mechanization
by Cameron Laird and Kathryn Soraiz
Much of what we do for our day-time employer, Phaseit, is automation:
Automation of operations on network devices, automation of test
sequences, customized emailings, automation of work-flow in
large governmental agencies, and so on. We recently
"scraped
the Web" using Python in a way
that's likely to interest readers.
Complications
Anything repetitive you do with a computer should trigger the
reaction, "How can I mechanize this, so I don't have to
type/click/select/... the same thing over and over?" Retrieval
of Web pages is a challenge that's yielded a wealth of
automations: bookmarks, command-line tools like lynx and wget, proprietary
scripting applications, capture-and-playback tricks,
browers plugins, and much more.
Part of the reason for this diversity is that the Web is
successful enough to have become complicated. The end-user
experience to which pages are targeted is affected by cookies,
JavaScript interpretation, Referer checks,
several forms of authentication, robots.txt directives, proxying, redirects, browser history, and more.
How do tools balance the simplicity that makes common cases
easy, with the flexibility to control all the details of
cookies and proxies and all the rest? One successful
organizational principle is object orientation (OO).<>
|