Metalworking (rec.crafts.metalworking) Discuss various aspects of working with metal, such as machining, welding, metal joining, screwing, casting, hardening/tempering, blacksmithing/forging, spinning and hammer work, sheet metal work.

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1   Report Post  
Posted to rec.crafts.metalworking
James Waldby
 
Posts: n/a
Default Archiving an author's postings [OT]

"DoN. Nichols" wrote:
According to David Merrill :
Unless I'm missing something, doing an advanced Google Groups search on
author, "Robert Bastow" in "rec.crafts.metalworking" returns 2860 'threads'
containing one or more messages from Teenut buried among numerous other
messages. From these entire threads one would have to copy Teenut's
individual messages and paste them into a text file; certainly possible, but
a laborious process.

Can anyone identify a more efficient way; possibly some of you in the Linux
world, or are your Web/Usenet readers as insulated from scripting tools as
seems to be the current case in the Windows world?


For some reason, I get 3010 or 3020 rather than 2860 threads, depending on
quotes and spaces in search terms. Eg, there are 3010 shown at (1 line url)
http://groups.google.com/groups/sear...obert%20bastow

[...] "lynx" is a text-only browser, which can work well for the
task, and can be coupled to shell scripts to do quite complex things.


True enough, although I prefer wget for automated webpage downloading,
in general. I presume lynx will save pages with html stripped out?
I'd see that as an advantage in an application like this.

"wget" can download entire trees of web pages, or individual
files, so a combination of lynx to find thinks, a shell script to run
it, and wget to download to files could do it nicely.


DoN, perhaps you could try wget on the following url and the one below.
http://groups.google.com/group/rec.c...ab0c3e00380a5f
From here, I get ERROR 403: Forbidden, although in a browser they bring
up R.B. pages ok. Maybe a cookie problem?

David, if you save the google search page in file t, the following
all-on-1-line command will generate a list of individual-message urls
in u from the thread urls in t: grep "/browse_thread/" t |sed -e "s|^.*/thread/[0-9a-f]*/|http://groups.google.com/group/rec.crafts.metalworking/msg/|" -e "s/?lnk.*$//" u
(Install cygwin package to get grep and sed and bash if using Windows.)

For example, the first grepped line is
font size="+0"a href="/group/rec.crafts.metalworking/browse_thread/thread/5e5973b836951947/bc0bf49e00214956?lnk=st&q=group%3Arec.crafts.metal working+author%3Arobert+author%3Abastow&rnum=1#bc0 bf49e00214956"Bolting down milling machine??/a/font 
and sed converts it to
http://groups.google.com/group/rec.c...0bf49e00214956

However, for Teenut's postings, the way that *I* would go for it
is to download the relevant years from the archives at the site which
holds the official (and long un-updated) FAQs for the newsgroup. At the

[snip details]

-jiw
  #2   Report Post  
Posted to rec.crafts.metalworking
Dave Hinz
 
Posts: n/a
Default Archiving an author's postings [OT]

On Tue, 16 May 2006 11:21:33 -0600, James Waldby wrote:
True enough, although I prefer wget for automated webpage downloading,
in general. I presume lynx will save pages with html stripped out?
I'd see that as an advantage in an application like this.


lynx -dump I think it is. Gives you a plain text version of it without
the HTML. I use wget for some things, lynx for others, depends on how I
want to parse it once I have it.

DoN, perhaps you could try wget on the following url and the one below.
http://groups.google.com/group/rec.c...ab0c3e00380a5f
From here, I get ERROR 403: Forbidden, although in a browser they bring
up R.B. pages ok. Maybe a cookie problem?


I know you can do cookies with wget but I've never done it. It's in TFM
though.


  #3   Report Post  
Posted to rec.crafts.metalworking
DoN. Nichols
 
Posts: n/a
Default Archiving an author's postings [OT]

According to James Waldby :
"DoN. Nichols" wrote:


[ ... ]

[...] "lynx" is a text-only browser, which can work well for the
task, and can be coupled to shell scripts to do quite complex things.


True enough, although I prefer wget for automated webpage downloading,
in general. I presume lynx will save pages with html stripped out?
I'd see that as an advantage in an application like this.


Yes -- it can. See this option:

================================================== ====================
-dump
dumps the formatted output of the default document or
one specified on the command line to standard out.
This can be used in the following way:

lynx -dump http://www.crl.com/~subir/lynx.html

================================================== ====================

"wget" can download entire trees of web pages, or individual
files, so a combination of lynx to find thinks, a shell script to run
it, and wget to download to files could do it nicely.


DoN, perhaps you could try wget on the following url and the one below.
http://groups.google.com/group/rec.c...ab0c3e00380a5f
From here, I get ERROR 403: Forbidden, although in a browser they bring
up R.B. pages ok. Maybe a cookie problem?


I get the same response -- until I add the following option:

--user-agent="Mozilla/4.0"

which causes it to identify itself to Google's server as a version of
Mozilla. Google does not like wget for whatever reasons. :-)

However, what that brought up was some 30K of javascript, which
wget can't process, so it won't get the rest of what you were after.
Processing that through the -dump option of lynx shows that it has some
30+ links in it, which might have been what you were after.

[ ... ]

However, for Teenut's postings, the way that *I* would go for it
is to download the relevant years from the archives at the site which
holds the official (and long un-updated) FAQs for the newsgroup. At the

[snip details]


It looks as though Scott Logan has posted the pointer to the
collection again -- elsewhere in this thread.

Enjoy,
DoN.

--
Email: | Voice (all times): (703) 938-4564
(too) near Washington D.C. | http://www.d-and-d.com/dnichols/DoN.html
--- Black Holes are where God is dividing by zero ---
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Childish answers to postings. Bookworm UK diy 106 April 21st 06 10:03 AM
OT (Somewhat): Why do some of you insist on NOT archiving posts? Larry Bud Woodworking 6 March 6th 05 03:55 AM
What is with the postings for help lately from complete idiots? [email protected] Electronics Repair 9 February 21st 05 07:20 PM
Off-topic postings and OT replies Hoyt Weathers Woodworking 3 May 22nd 04 01:19 PM
Abusive postings Jeff Gorman Woodworking 8 August 29th 03 07:52 PM


All times are GMT +1. The time now is 03:46 AM.

Powered by vBulletin® Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright ©2004-2025 DIYbanter.
The comments are property of their posters.
 

About Us

"It's about DIY & home improvement"