DIYbanter

DIYbanter (https://www.diybanter.com/)
-   Metalworking (https://www.diybanter.com/metalworking/)
-   -   Copy a website (https://www.diybanter.com/metalworking/259089-copy-website.html)

Gunner Asch[_4_] August 31st 08 02:11 AM

Copy a website
 
www.httrack.com/

Been using it for years. Works on MOST websites, though not all.

Gunner

"Confiscating wealth from those who have earned it, inherited it,
or got lucky is never going to help 'the poor.' Poverty isn't
caused by some people having more money than others, just as obesity
isn't caused by McDonald's serving super-sized orders of French fries
Poverty, like obesity, is caused by the life choices that dictate
results." - John Tucci,

Ignoramus4791 August 31st 08 03:10 AM

Copy a website
 
On 2008-08-31, Gunner Asch wrote:
www.httrack.com/

Been using it for years. Works on MOST websites, though not all.


If you try downloading my website algebra.com, you will get into an
infinite recursion through millions of pages. That's why I prevent
most such bots from accessing my site. This would work only on very
simple sites.

--
Due to extreme spam originating from Google Groups, and their inattention
to spammers, I and many others block all articles originating
from Google Groups. If you want your postings to be seen by
more readers you will need to find a different means of
posting on Usenet.
http://improve-usenet.org/

Martin H. Eastburn August 31st 08 05:07 AM

Copy a website
 
Some bots as you say - limit to 1 or 2 or 3 levels deep only.

Martin
Martin H. Eastburn
@ home at Lions' Lair with our computer lionslair at consolidated dot net
TSRA, Endowed; NRA LOH & Patron Member, Golden Eagle, Patriot's Medal.
NRA Second Amendment Task Force Charter Founder
IHMSA and NRA Metallic Silhouette maker & member.
http://lufkinced.com/


Ignoramus4791 wrote:
On 2008-08-31, Gunner Asch wrote:
www.httrack.com/

Been using it for years. Works on MOST websites, though not all.


If you try downloading my website algebra.com, you will get into an
infinite recursion through millions of pages. That's why I prevent
most such bots from accessing my site. This would work only on very
simple sites.



----== Posted via Pronews.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.pronews.com The #1 Newsgroup Service in the World! 100,000 Newsgroups
---= - Total Privacy via Encryption =---

Cydrome Leader August 31st 08 08:16 AM

Copy a website
 
Ignoramus4791 wrote:
On 2008-08-31, Gunner Asch wrote:
www.httrack.com/

Been using it for years. Works on MOST websites, though not all.


If you try downloading my website algebra.com, you will get into an
infinite recursion through millions of pages. That's why I prevent
most such bots from accessing my site. This would work only on very
simple sites.


you must be very smart to have such a complex and sophisticated website.



Gunner Asch[_4_] August 31st 08 08:19 AM

Copy a website
 
On Sat, 30 Aug 2008 21:10:32 -0500, Ignoramus4791
wrote:

On 2008-08-31, Gunner Asch wrote:
www.httrack.com/

Been using it for years. Works on MOST websites, though not all.


If you try downloading my website algebra.com, you will get into an
infinite recursion through millions of pages. That's why I prevent
most such bots from accessing my site. This would work only on very
simple sites.



Yours was one of the sites it doesnt work well on...chuckle

Been there, tried that..

But it works quite well on most others.

Gunner

"Confiscating wealth from those who have earned it, inherited it,
or got lucky is never going to help 'the poor.' Poverty isn't
caused by some people having more money than others, just as obesity
isn't caused by McDonald's serving super-sized orders of French fries
Poverty, like obesity, is caused by the life choices that dictate
results." - John Tucci,

[email protected] August 31st 08 09:27 PM

Copy a website
 

If you try downloading my website algebra.com, you will get into an
infinite recursion through millions of pages. That's why I prevent
most such bots from accessing my site. This would work only on very
simple sites.


How does your web server differentiate between a bot and a human user
making http requests?

Regards,

Robin

Lloyd E. Sponenburgh[_3_] August 31st 08 11:27 PM

Copy a website
 
fired this volley in news:b6e6ea8f-a3fe-4f46-
:


If you try downloading my website algebra.com, you will get into an
infinite recursion through millions of pages. That's why I prevent
most such bots from accessing my site. This would work only on very
simple sites.


How does your web server differentiate between a bot and a human

user
making http requests?


Duh! It doesn't. The site has links back to the place where the link
began. It wouldn't appear recursive to a human user, because that
person would choose where he/she viewed. The spider can't tell, and
ends up in recursions it can only abort by "counting out" repeats.

LLoyd

Mark Rand August 31st 08 11:40 PM

Copy a website
 
On Sat, 30 Aug 2008 18:11:21 -0700, Gunner Asch
wrote:

www.httrack.com/

Been using it for years. Works on MOST websites, though not all.

Gunner



I tend to just use wget. Helps if you've got *nix for an OS or the Cygwin
utilities for windoze though.


Mark Rand
RTFM

Ignoramus3863 August 31st 08 11:51 PM

Copy a website
 
On 2008-08-31, Lloyd E. Sponenburgh lloydspinsidemindspring.com wrote:
fired this volley in news:b6e6ea8f-a3fe-4f46-
:


If you try downloading my website algebra.com, you will get into an
infinite recursion through millions of pages. That's why I prevent
most such bots from accessing my site. This would work only on very
simple sites.


How does your web server differentiate between a bot and a human

user
making http requests?


Duh! It doesn't. The site has links back to the place where the link
began. It wouldn't appear recursive to a human user, because that
person would choose where he/she viewed. The spider can't tell, and
ends up in recursions it can only abort by "counting out" repeats.


I actually have some smarts in the server that can tell a bot from a
human. But httrack is blocked on the spot in any case. I am not
against it, as such, but it will not work on my site.

--
Due to extreme spam originating from Google Groups, and their inattention
to spammers, I and many others block all articles originating
from Google Groups. If you want your postings to be seen by
more readers you will need to find a different means of
posting on Usenet.
http://improve-usenet.org/

Richard J Kinch September 1st 08 05:18 AM

Copy a website
 
Lloyd E. Sponenburgh writes:

The spider can't tell,


For one, "wget" can certainly detect and ignore recursive loops.

Richard J Kinch September 1st 08 05:19 AM

Copy a website
 
Ignoramus3863 writes:

I actually have some smarts in the server that can tell a bot from a
human.


Not a bot attempting to look human. Just bots that advertise their
botness, by honest design or flawed hacking.

Ignoramus3863 September 1st 08 05:24 AM

Copy a website
 
On 2008-09-01, Richard J Kinch wrote:
Ignoramus3863 writes:

I actually have some smarts in the server that can tell a bot from a
human.


Not a bot attempting to look human. Just bots that advertise their
botness, by honest design or flawed hacking.


Yes, a bot trying to look like a human (ie supplying Referer and
browser-like User-Agent, I can still detec that it is a bot).

The way I detect is is that there is a hidden link that humans cannot
see, and cannot click, but bots would follow it. The hidden link is
not permitted by robots.txt, so it catches all non-compliant bots.
--
Due to extreme spam originating from Google Groups, and their inattention
to spammers, I and many others block all articles originating
from Google Groups. If you want your postings to be seen by
more readers you will need to find a different means of
posting on Usenet.
http://improve-usenet.org/

Richard J Kinch September 1st 08 05:35 AM

Copy a website
 
Ignoramus3863 writes:

The way I detect is is that there is a hidden link that humans cannot
see, and cannot click, but bots would follow it.


Yes, that would be difficult to defeat.

Ignoramus3863 September 1st 08 05:40 AM

Copy a website
 
On 2008-09-01, Richard J Kinch wrote:
Ignoramus3863 writes:

The way I detect is is that there is a hidden link that humans cannot
see, and cannot click, but bots would follow it.


Yes, that would be difficult to defeat.


I had a lot of troubles with httrack and other bots like this. The
people who run them usually are not meaning anything bad, they just do
not realize that they should not run it against dynamic sites like
mine. They may not even realize that my site is dynamic because it
tried to look not to be (search engine friendly and all).

I spent a very long time trying to 1) make a website which hopefully
does not lead into too many infinite crawlings, and 2) detect and stop
bad bots early enough. But I still get problems from time to time.

--
Due to extreme spam originating from Google Groups, and their inattention
to spammers, I and many others block all articles originating
from Google Groups. If you want your postings to be seen by
more readers you will need to find a different means of
posting on Usenet.
http://improve-usenet.org/

Gunner Asch[_4_] September 1st 08 12:59 PM

Copy a website
 
On Sun, 31 Aug 2008 17:51:00 -0500, Ignoramus3863
wrote:

On 2008-08-31, Lloyd E. Sponenburgh lloydspinsidemindspring.com wrote:
fired this volley in news:b6e6ea8f-a3fe-4f46-
:


If you try downloading my website algebra.com, you will get into an
infinite recursion through millions of pages. That's why I prevent
most such bots from accessing my site. This would work only on very
simple sites.

How does your web server differentiate between a bot and a human

user
making http requests?


Duh! It doesn't. The site has links back to the place where the link
began. It wouldn't appear recursive to a human user, because that
person would choose where he/she viewed. The spider can't tell, and
ends up in recursions it can only abort by "counting out" repeats.


I actually have some smarts in the server that can tell a bot from a
human. But httrack is blocked on the spot in any case. I am not
against it, as such, but it will not work on my site.



Why?

Gunner

"Confiscating wealth from those who have earned it, inherited it,
or got lucky is never going to help 'the poor.' Poverty isn't
caused by some people having more money than others, just as obesity
isn't caused by McDonald's serving super-sized orders of French fries
Poverty, like obesity, is caused by the life choices that dictate
results." - John Tucci,

Gunner Asch[_4_] September 1st 08 01:01 PM

Copy a website
 
On Sun, 31 Aug 2008 23:40:35 -0500, Ignoramus3863
wrote:

On 2008-09-01, Richard J Kinch wrote:
Ignoramus3863 writes:

The way I detect is is that there is a hidden link that humans cannot
see, and cannot click, but bots would follow it.


Yes, that would be difficult to defeat.


I had a lot of troubles with httrack and other bots like this. The
people who run them usually are not meaning anything bad, they just do
not realize that they should not run it against dynamic sites like
mine. They may not even realize that my site is dynamic because it
tried to look not to be (search engine friendly and all).

I spent a very long time trying to 1) make a website which hopefully
does not lead into too many infinite crawlings, and 2) detect and stop
bad bots early enough. But I still get problems from time to time.



Whats wrong with bots harvesting your manuals?

Frankly..on dialup..I dont have the time to hit each and every manual
and wait for a download to start and finish.

On sites such as yours, I run the program and go to bed.


Gunner

"Confiscating wealth from those who have earned it, inherited it,
or got lucky is never going to help 'the poor.' Poverty isn't
caused by some people having more money than others, just as obesity
isn't caused by McDonald's serving super-sized orders of French fries
Poverty, like obesity, is caused by the life choices that dictate
results." - John Tucci,

Ignoramus32074 September 1st 08 01:12 PM

Copy a website
 
On 2008-09-01, Gunner Asch wrote:
On Sun, 31 Aug 2008 23:40:35 -0500, Ignoramus3863
wrote:

On 2008-09-01, Richard J Kinch wrote:
Ignoramus3863 writes:

The way I detect is is that there is a hidden link that humans cannot
see, and cannot click, but bots would follow it.

Yes, that would be difficult to defeat.


I had a lot of troubles with httrack and other bots like this. The
people who run them usually are not meaning anything bad, they just do
not realize that they should not run it against dynamic sites like
mine. They may not even realize that my site is dynamic because it
tried to look not to be (search engine friendly and all).

I spent a very long time trying to 1) make a website which hopefully
does not lead into too many infinite crawlings, and 2) detect and stop
bad bots early enough. But I still get problems from time to time.



Whats wrong with bots harvesting your manuals?


NOthing.

But if you go to algebra.com, you can accidentally go into an infinite
loop with various scripts.

Frankly..on dialup..I dont have the time to hit each and every manual
and wait for a download to start and finish.

On sites such as yours, I run the program and go to bed.


You may need to sleep a lot longer than anticipated.

i


Gunner

"Confiscating wealth from those who have earned it, inherited it,
or got lucky is never going to help 'the poor.' Poverty isn't
caused by some people having more money than others, just as obesity
isn't caused by McDonald's serving super-sized orders of French fries
Poverty, like obesity, is caused by the life choices that dictate
results." - John Tucci,


--
Due to extreme spam originating from Google Groups, and their inattention
to spammers, I and many others block all articles originating
from Google Groups. If you want your postings to be seen by
more readers you will need to find a different means of
posting on Usenet.
http://improve-usenet.org/

Leon Fisk September 1st 08 07:48 PM

Copy a website
 
On Sun, 31 Aug 2008 23:40:23 +0100, Mark Rand
wrote:

On Sat, 30 Aug 2008 18:11:21 -0700, Gunner Asch
wrote:

www.httrack.com/

Been using it for years. Works on MOST websites, though not all.

Gunner



I tend to just use wget. Helps if you've got *nix for an OS or the Cygwin
utilities for windoze though.


Mark Rand
RTFM


I use wget for this too, provided saving a few pages doesn't
work out so well. There are Windows binaries available, no
need for Cygwin. For example:

http://xoomer.alice.it/hherold/

Wget won't hold your hand though, command line and a little
bit of reading/homework suggested for it to be really
useful...

--
Leon Fisk
Grand Rapids MI/Zone 5b
Remove no.spam for email


All times are GMT +1. The time now is 04:49 AM.

Powered by vBulletin® Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004 - 2014 DIYbanter