PythonLanguage implementations of web link lister. (See WebLinkListExample.) Here's an implementation that lists the hrefs from a web document using a HyperTextMarkupLanguage parser. It uses standard modules HTMLParser, urllib, and sys. No special setup is required after Python is installed. The implementation is pretty idiot-proof and fails gracefully when an error is encountered. 'linklister1.py' from HTMLParser import HTMLParser class Link''''''Parser(HTMLParser): def reset(self): HTMLParser.reset(self) self.hrefs = [] def handle_starttag(self, tag, attrs): href = [v for k,v in attrs if k.lower() == 'href'] if href: self.hrefs.extend(href) def listLinks(url): import urllib print '*** links in', url, '***' parser = Link''''''Parser() try: parser.feed(urllib.urlopen(url).read()) except Exception, e: print '%s: %s\n' % (e, url) return parser.close() for href in parser.hrefs: print href if not parser.hrefs: print 'None!' print if __name__ == '__main__': import sys if len(sys.argv) > 1: for url in sys.argv[1:]: listLinks(url) else: print 'usage: linklister1 url [url2 ...]' The program can parse both HTML and XHTML. It retrieves the values of all href attributes it finds in all the start tags. The Link''''''Parser.handle_starttag method can be modified to process only the tags of interest, e.g., and but not and . Results from www.google.com on 19-Apr-2004 *** links in http://www.google.com/ *** /imghp?hl=en&tab=wi&ie=UTF-8 /grphp?hl=en&tab=wg&ie=UTF-8 /nwshp?hl=en&tab=wn&ie=UTF-8 /froogle?hl=en&tab=wf&ie=UTF-8 /froogle?hl=en&tab=wf&ie=UTF-8 /options/index.html /advanced_search?hl=en /preferences?hl=en /language_tools?hl=en /ads/ /services/ /about.html ---- For something completely different, here's an implementation that lists ''all'' the hrefs using RegularExpression''''''s. It uses standard modules urllib, re, and sys. No special setup is required after Python is installed. The implementation is pretty idiot-proof and fails gracefully when an error is encountered. Tests with Python's timeit module show this regexp implementation is about 22% faster than the HTMLParser implementation on a local html file with 59 hrefs, using Python 2.3 on MacOs 9. 'linklister2.py' def listHrefs(url): import urllib, re lookFor = r'href\s*=\s*\"([^\"]*)|href\s*=\s*\'([^\']*)|href\s*=\s*([^\s\"\'>]+)' clookFor = re.compile(lookFor, re.I) print '*** hrefs in', url, '***' try: document = urllib.urlopen(url).read() except Exception, e: print '%s: %s\n' % (e, url) return hrefs = clookFor.findall(document) for tup in hrefs: for href in tup: if href: print href if not hrefs: print 'None!' print if __name__ == '__main__': import sys if len(sys.argv) > 1: for url in sys.argv[1:]: listHrefs(url) else: print 'usage: linklister2 url [url2 ...]' The program can search any document until an EOF marker is encountered. If there happens to exist 'href="blah"' in the text of a web document, this regexp implementation finds it in addition to the href attributes in the tags. Results from www.google.com on 19-Apr-2004 *** hrefs in http://www.google.com/ *** /imghp?hl=en&tab=wi&ie=UTF-8 /grphp?hl=en&tab=wg&ie=UTF-8 /nwshp?hl=en&tab=wn&ie=UTF-8 /froogle?hl=en&tab=wf&ie=UTF-8 /froogle?hl=en&tab=wf&ie=UTF-8 /options/index.html /advanced_search?hl=en /preferences?hl=en /language_tools?hl=en /ads/ /services/ /about.html ---- Contributor(s): ElizabethWiethoff ---- CategoryPython