Robots.txt in the Vanilla root is needed!

phreak · August 2010

Hi all,

A robots.txt is needed in the Vanilla root! Should be delivered with the download.
It's a must, i think. Any counterarguments?

What code do you think should the Vanilla robots.txt hold?

Todd · August 2010

I don't know off hand. Do you have a suggestion for the basics?

phreak · September 2010

Hi Todd, sorry for the late return.

Mmh, usually CMS's i use block out there system-files. Means, they define a list of folders or files that are then "usually" not crawlable and indexable by search engines.

This can come in handy if there is known security issue thats not yet fixed and script kiddies are searching the related file to test out their latest milw0rm script.

Ähm, but thats just what i've heard from i'm not really a specialist regarding the use of robots.txt (sorry, if my entrance-statement suggested this.

). I don't know if this still of any regular issue.

Linc · September 2010

@phreak You misunderstand how robots.txt work. It's merely a request to not crawl content, not a blocking mechanism. There is no security in it whatsoever. I don't see any need for a robots.txt file in the software.

mclovin · September 2010

@phpfreak, Just add a robots.txt file in your root directory with "User-agent: *" and you should be fine. But you really dont benefit from Google crawling your 'signin', 'register', 'admin' urls. Thats why you can block those urls from robots.txt. For sample of how to do that, you can refer to any of the major sites robots.txt eg. http://stackoverflow.com/robots.txt

Since there are only a handful of such unwanted pages, I would just add a robots.txt myself into my root folder of installation.

However saying that, vanilla could do a lot of other things for great SEO eg. Title generation, Meta Tags generation based on page content, tags and category, canonical urls to indicate duplicate content etc.

Thanks.
http://nepali.im

LinusIndigo · September 2010

There's no security in a robots.txt file, in fact the opposite is usually true. A malicious bot or crawler would ignore the file anyway. It does seem like vanilla should have one though, for the reasons mclovin has stated.

phreak · September 2010

@mclovin: thanx for your input.

@lincoln & LinusIndigo: I didn't mean what you understand. I stated that server-related files can get indexed by search engines (different kinds of) and could probably be listed by a simple search. Means if a script kiddy tries a script on 50 pages (of the list) and you're page is in the search results, it could be a security vulnerability.

This can be of various reasons, for example if you accidentially have a folder and its files set to CHMOD 777, because you forgot to set it back. The robots.txt would just hinder a file in this folder to get indexed.

achal · October 2012

Hi all, recently i was applying my google advsense, but it failed showing that your site is incomplete, I then drill down the possible reason and came to know that google has crawled content less pages like tags, profile, discussion and came up with result that my site has incomplete or less data in comparison to number of pages.

One solution is that I put robot.txt to avoid crawling in these pages, but I fear that if i block discussion page, will it also block rest of forum content as it behaves discussion as directory structure.

Can any one help me formulate robot.txt so that I don't end up blocking important content page

For your reference: link to the site in case necessary http://myhealthfellow.com

achal · October 2012

Hi I have searched and made custom robots.txt specially for vanilla forum. following is the code for the same

User-agent: *
Crawl-delay: 5
Disallow: /discussions/tagged/
Disallow: /profile/
Disallow: /entry/
Disallow: /activity

# for google adsense
# As these pages don't have too much content we will not include them for checking for google adsense

User-agent: Mediapartners-Google
Disallow: /categories
Disallow: /discussions/tagged/
Disallow: /profile/
Disallow: /entry/
Disallow: /activity
Disallow: /discussions/popular
Disallow: /discussions/unanswered

achal · October 2012

Hi I have searched and made custom robots.txt specially for vanilla forum. following is the code for the same
Pls find attached robots.txt file

Andy K · October 2012

Yeah, there should absolutely be NO DEFAULT robots.txt. Why? Because:
1) Anyone can create one themselves, and
2) Everyone has different needs for search engines.

In my own case, I didn't really care one way or another until I discovered that MS's BING and Yahoo's crawler were tearing my site apart in terms of page hits. I blocked Yahoo pretty easy, Bing took forever because MS's Bing search engines just seem to sometimes totally ignore robots.txt (or it takes a long time for a specific server to ack it). They suck and I want them to die in a lake.

I want my stuff to show up in Google, so I don't block anything from Google's crawlers (and Google is so unobtrusive compared to Bing it's crazy).

So I fear that people will go and take a generic robots.txt and block everything rather than actually blocking problem engines...

Todd · October 2012

This is a good discussion to have so that people know the issues with search engines. If you look at the robots.txt Vanilla uses with its sitemaps plugin we do the following:

Sitemap: http://vanillaforums.org/sitemapindex.xml

User-agent: *
Disallow: /entry/
Disallow: /messages/
Disallow: /profile/comments/
Disallow: /profile/discussions/
Disallow: /search/

We also put some noindex/nofollow rels and meta tags throughout the software to try and keep crawlers off certain pages that won't benefit your site.

The one page where we are still seeing problems is the recent discussions list. Big sites can see real slowdowns when crawlers hit pages in the thousands. Recently we've added a config parameter to limit the page count on recent discussions to a reasonable number like five pages. Crawlers can find the pages through the individual categories and they are much faster.

Limiting the page lists is the direction we'll be going in with the software. I just need to get a good ux for searching past that fifth page.

Anonymoose · October 2012

An alternative to blocking access to directories with program only content is to put program files outside of the directory from which html is served. Although that might complicate installation a bit.

Robots.txt in the Vanilla root is needed!

Comments