Short: V1.02 Extract URL's from any file+sort++ Author: frans@xfilesystem.freeserve.co.uk (francis swift) Uploader: frans xfilesystem freeserve co uk (francis swift) Type: comm/www Replaces: urlx.lha Architecture: m68k-amigaos Url: www.xfilesystem.freeserve.co.uk Some quick'n'nasty hacks, but I've included the source for you to look at, especially as urlx uses btree routines and there aren't that many simple examples of using btrees. The btree routines used are by Christopher R. Hertel and are available in full on the Aminet as BinaryTrees.lzh in dev/c. V1.02 ----- Some bugfixes/improvements in scanv, plus new template option in urlx, for which I've included an example template file for one particular version of Voyager. Use something like urlx -p -a -u -t temp_voyager infile Bookmarks.html to get an html bookmarks file. V1.01 ----- Added functionality to scanv to enable it to be used instead of treecat for Voyager cache only. This is to eliminate some of the bogus url's that would be thrown up by the previous method (below) using treecat|urlx. The new method for scanning the Voyager cache (from sh/pdksh) is eg scanv -c dh0:Voyager/cache | urlx -p -u - outfile which uses the new -c flag to cat (output) the contents of each file which are then piped through urlx for processing. Of course, treecat is still necessary for other caches eg AWeb and Netscape. urlx ---- This program searches a file for url's (http:// etc) and prints them or outputs them to a file. Internally it stores them in a btree to allow duplicates to be eliminated and optionally to allow the output to be sorted. There are various options: -s selects a simple alphabetic sort for the output -u selects a special url sort that should provide better grouping of similar site names (basically it sorts the first url element in groups backwards) -h select html output format for making quick bookmark files, instead of the default straight text output -t use a template file for output formatting -p retain parameters after url's, by default these are ignored -a allow accented characters in url's (i.e. characters > 127). -. select just files with extension ., for example to show only .jpg url's you would use -.jpg, and for .html you would use -.htm (which matches both .htm and .html) -i a special file selection option which tries to intelligently select only url's that are likely to be html's, both by using the extension and by examining the path Basically there are lots of options but you'll probably just end up using: urlx -u infile outfile which uses the special url sort, or urlx -u -h infile outfile.html for making a bookmark file. In both above examples you might want to use -p to retain parameters, (the bits after the question marks, eg http://yes.or.no?wintel=crap). treecat ------- This is just a quick hack to let shell (sh/pdksh) users grab url's from a complete directory tree. urlx accepts a single dash as meaning input is from stdin, so you can use something like treecat cachedirectorypath | urlx -u - outfilename to produce a file containing every url in every file in your cache. You can use this on any browser cache tree. scanv ----- This is used specifically to pick out the url's from the headers on the files in a voyager cache. This is just the url of the file itself, the contents are by default not examined. NEW (1.01): -c flag to cat (output) contents of file for piping to urlx. urlv ---- This is used specifically to grab url's from a Voyager history file, usually called URL-History.1. urla ---- This is used specifically to grab url's from an AWeb cache index file, usually called AWCR. stricmp_test ------------ Just a quick test prog to see which order the compiler (libc really) sorts strings in stricmp calls. Different compilers use different orders :-(