EPICS Re: Tech-talk archives excluded from indexing?

Experimental Physics and Industrial Control System

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 <2010> 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025	Index	1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 <2010> 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
<== Date ==>		<== Thread ==>

Subject:	Re: Tech-talk archives excluded from indexing?
From:	John Hammonds <[email protected]>
To:	[email protected]
Date:	Wed, 28 Jul 2010 14:00:21 -0500

Andrew,

Can't remember where I saw this but one list replaced added an additional layer to the typical replacement of _at_ with _at_SOMEOTHERTEXT_ to further confuse crawlers. The need to remove the extra text was then explained on the web page.

John

On 7/28/2010 12:24 PM, Andrew Johnson wrote:

Hi Angus,

On Tuesday 27 July 2010 23:18:49 Angus Gratton wrote:

When I looked in the /robots.txt file I saw that tech-talk is explicitly
excluded from crawler indexing:

Disallow: /epics/core-talk
Disallow: /epics/mantis
Disallow: /epics/tech-talk
Disallow: /epics/wiki

I'm wondering why this is, and if it can possibly be undone?

It dates back to the days when spam filters were much less effective than they
are nowadays; I was trying to be kind to the users of tech-talk by keeping
their email addresses from being scraped off the website.  I admit an email
harvester might ignore the robots.txt file if they are doing their own
crawling, but at least this prevents them from finding addresses via regular
search engines.  Maybe this doesn't matter as much nowadays, would anyone like
to comment?

The Mhonarc software that I use to maintain the archive generates mailto: URLs
in its HTML output wherever it sees an email address in the message header or
body.  Elsewhere on the EPICS site I replace the '@' sign a mailto: URLs with
'_at_' and I've worked out how to get Mhonarc to do that too, but it doesn't
provide an obvious way to rewrite the message text that the URL decorates,
which will still contain the complete email address.  Hopefully just changing
the mailto: is sufficient to deter the spammers, or this is not necessary any
more.

I have reprocessed all the tech-talk and core-talk archive files to rewrite
the mailto: URLs as described above; if there are no complaints I will open up
the tech-talk and core-talk archives to the web crawlers in a day or two.

Mantis has been replaced by the bug tracker at Launchpad.net; I'm trying to
get the Wiki replaced with a newer version that I don't have to manage, so
that is likely to move at some point and I'd rather not have Google hit it
until then.

HTH,

- Andrew


--
John Hammonds
Software Services Group
APS Engineering Support Division

Argonne National Laboratory
[email protected]<mailto:[email protected]>
(630)252-5317

References:: Tech-talk archives excluded from indexing? Angus Gratton; Re: Tech-talk archives excluded from indexing? Andrew Johnson

Navigate by Date:: Prev: Re: HTML Device Driver emmanuel_mayssat; Next: RE: channel archiver Debora M. Kerstiens; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 <2010> 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
Navigate by Thread:: Prev: Re: Tech-talk archives excluded from indexing? Angus Gratton; Next: Re: Tech-talk archives excluded from indexing? Maren Purves; Index: 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 <2010> 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025