Please upgrade here. These earlier versions are no longer being updated and have security issues.
HackerOne users: Testing against this community violates our program's Terms of Service and will result in your bounty being denied.
Options

Robots.txt in the Vanilla root is needed!

phreakphreak Vanilla*APP (White Label) & Vanilla*Skins Shop MVP
edited August 2010 in Vanilla 2.0 - 2.8
Hi all,

A robots.txt is needed in the Vanilla root! Should be delivered with the download.
It's a must, i think. Any counterarguments?

What code do you think should the Vanilla robots.txt hold?
  • VanillaAPP | iOS & Android App for Vanilla - White label app for Vanilla Forums OS
  • VanillaSkins | Plugins, Themes, Graphics and Custom Development for Vanilla

Comments

  • Options
    ToddTodd Chief Product Officer Vanilla Staff
    I don't know off hand. Do you have a suggestion for the basics?
  • Options
    phreakphreak Vanilla*APP (White Label) & Vanilla*Skins Shop MVP
    Hi Todd, sorry for the late return.

    Mmh, usually CMS's i use block out there system-files. Means, they define a list of folders or files that are then "usually" not crawlable and indexable by search engines.

    This can come in handy if there is known security issue thats not yet fixed and script kiddies are searching the related file to test out their latest milw0rm script.

    Ähm, but thats just what i've heard from i'm not really a specialist regarding the use of robots.txt (sorry, if my entrance-statement suggested this. ;) ). I don't know if this still of any regular issue.
    • VanillaAPP | iOS & Android App for Vanilla - White label app for Vanilla Forums OS
    • VanillaSkins | Plugins, Themes, Graphics and Custom Development for Vanilla
  • Options
    LincLinc Detroit Admin
    edited September 2010
    @phreak You misunderstand how robots.txt work. It's merely a request to not crawl content, not a blocking mechanism. There is no security in it whatsoever. I don't see any need for a robots.txt file in the software.
  • Options
    @phpfreak, Just add a robots.txt file in your root directory with "User-agent: *" and you should be fine. But you really dont benefit from Google crawling your 'signin', 'register', 'admin' urls. Thats why you can block those urls from robots.txt. For sample of how to do that, you can refer to any of the major sites robots.txt eg. http://stackoverflow.com/robots.txt

    Since there are only a handful of such unwanted pages, I would just add a robots.txt myself into my root folder of installation.

    However saying that, vanilla could do a lot of other things for great SEO eg. Title generation, Meta Tags generation based on page content, tags and category, canonical urls to indicate duplicate content etc.

    Thanks.
    http://nepali.im


  • Options
    There's no security in a robots.txt file, in fact the opposite is usually true. A malicious bot or crawler would ignore the file anyway. It does seem like vanilla should have one though, for the reasons mclovin has stated.
  • Options
    phreakphreak Vanilla*APP (White Label) & Vanilla*Skins Shop MVP
    @mclovin: thanx for your input.

    @lincoln & LinusIndigo: I didn't mean what you understand. I stated that server-related files can get indexed by search engines (different kinds of) and could probably be listed by a simple search. Means if a script kiddy tries a script on 50 pages (of the list) and you're page is in the search results, it could be a security vulnerability.

    This can be of various reasons, for example if you accidentially have a folder and its files set to CHMOD 777, because you forgot to set it back. The robots.txt would just hinder a file in this folder to get indexed.
    • VanillaAPP | iOS & Android App for Vanilla - White label app for Vanilla Forums OS
    • VanillaSkins | Plugins, Themes, Graphics and Custom Development for Vanilla
  • Options
    achalachal New
    edited October 2012

    Hi all, recently i was applying my google advsense, but it failed showing that your site is incomplete, I then drill down the possible reason and came to know that google has crawled content less pages like tags, profile, discussion and came up with result that my site has incomplete or less data in comparison to number of pages.

    One solution is that I put robot.txt to avoid crawling in these pages, but I fear that if i block discussion page, will it also block rest of forum content as it behaves discussion as directory structure.

    Can any one help me formulate robot.txt so that I don't end up blocking important content page

    For your reference: link to the site in case necessary http://myhealthfellow.com

  • Options
    achalachal New
    edited October 2012

    Hi I have searched and made custom robots.txt specially for vanilla forum. following is the code for the same

    User-agent: *
    Crawl-delay: 5
    Disallow: /discussions/tagged/
    Disallow: /profile/
    Disallow: /entry/
    Disallow: /activity
    
    # for google adsense
    # As these pages don't have too much content we will not include them for checking for google adsense
    
    User-agent: Mediapartners-Google
    Disallow: /categories
    Disallow: /discussions/tagged/
    Disallow: /profile/
    Disallow: /entry/
    Disallow: /activity
    Disallow: /discussions/popular
    Disallow: /discussions/unanswered
    
  • Options

    Hi I have searched and made custom robots.txt specially for vanilla forum. following is the code for the same
    Pls find attached robots.txt file

  • Options

    Yeah, there should absolutely be NO DEFAULT robots.txt. Why? Because:
    1) Anyone can create one themselves, and
    2) Everyone has different needs for search engines.

    In my own case, I didn't really care one way or another until I discovered that MS's BING and Yahoo's crawler were tearing my site apart in terms of page hits. I blocked Yahoo pretty easy, Bing took forever because MS's Bing search engines just seem to sometimes totally ignore robots.txt (or it takes a long time for a specific server to ack it). They suck and I want them to die in a lake.

    I want my stuff to show up in Google, so I don't block anything from Google's crawlers (and Google is so unobtrusive compared to Bing it's crazy).

    So I fear that people will go and take a generic robots.txt and block everything rather than actually blocking problem engines...

  • Options
    ToddTodd Chief Product Officer Vanilla Staff

    This is a good discussion to have so that people know the issues with search engines. If you look at the robots.txt Vanilla uses with its sitemaps plugin we do the following:

    Sitemap: http://vanillaforums.org/sitemapindex.xml
    
    User-agent: *
    Disallow: /entry/
    Disallow: /messages/
    Disallow: /profile/comments/
    Disallow: /profile/discussions/
    Disallow: /search/
    

    We also put some noindex/nofollow rels and meta tags throughout the software to try and keep crawlers off certain pages that won't benefit your site.

    The one page where we are still seeing problems is the recent discussions list. Big sites can see real slowdowns when crawlers hit pages in the thousands. Recently we've added a config parameter to limit the page count on recent discussions to a reasonable number like five pages. Crawlers can find the pages through the individual categories and they are much faster.

    Limiting the page lists is the direction we'll be going in with the software. I just need to get a good ux for searching past that fifth page.

  • Options

    An alternative to blocking access to directories with program only content is to put program files outside of the directory from which html is served. Although that might complicate installation a bit.

Sign In or Register to comment.