Vanilla 1 is no longer supported or maintained. If you need a copy, you can get it here.
HackerOne users: Testing against this community violates our program's Terms of Service and will result in your bounty being denied.

Scaling, load balancing

edited January 2008 in Vanilla 1.0 Help
Evaluating forum packages has been fairly depressing - until I stumbled upon Vanilla. Finally something that isn't yet another clone of every other ugly forum out there (beside bbPress - but Vanilla's better). So I download it - installed it in minutes ... and I love it. But I've got questions. The big one for me is ... will this baby scale? I'm sure I'm not the only one who hopes the site I create will become popular. But it surprises me there aren't more discussions about this topic.

I've got some more directed questions that other posts :

-Searches. I don't see any kind of indexing of keywords - nor use of MySQL's "fulltext" feature (built-in keyword indexes for full-text searching). Keyword indexing scares me -it's complicated and bulky - but it's fast. I believe phpBB has dedicated tables to do so. So what gives? I find searching this forum pretty speedy. Does anyone have a good reason why "fulltext" should not be used? Are there issues with hyphenated strings - or multi-lingual?? Are there any plans for some kind of keyword indexing in the future?

-Caching. There isn't any. Caching HTML and Database queries are the natural choices. Though even with my very limited knowledge of the app - I see problems with both of these - given that Vanilla makes a lot of use of recent data. But - for instance - couldn't caching the main "discussions" page be done on a per user basis - with invalidation able to be performed per user (e.g. when user reads a discussion and the highlight changes) and globally (e.g. when a new post is made) basis? Load balancing this would be the tricky part -see next question. Anyway - are there plans to create a caching mechanism ?

-Load balancing. This is most important. Replicating or clustering the database can be done -that's not the issue. Having multiple web servers is. I see that all data is stored in a database - except images (which can be centralized via NFS anyway - provided file locking issues aren't a problem). is that all true? If so - it just comes down to session handling :

I see Vanilla uses PHP's std built-in "SESSION[]". This (by default) creates a PHPSESSID cookie with a key - which maps to a file holding the serialized session variables. While this is fine for most - I have a problem with it -as these "server-side sessions" (meaning that session data is stored on the server) do not enable sessions past one server. Load balancing Vanilla would involve forcing sessions to go back to the machine that created them. While load-balancers (such as "Pen" and "Pound") are capable of doing this - it really doesn't lend itself to "load balancing". Over time - you find one machine is over-loaded with users who never sign off - while others are sitting around doing nothing.

I'm no expert - but I see Zend provides a package with cross-session support. Prob real expensive - and besides - it's a bandaid approach. Beyond that - I don't see a way of achieving this without creating custom session handling. I see a lot of custom stuff in Vanilla - and I think smart session handling would be a great option.

So an alternative to "server-side sessions" are "client-side sessions". Rather than have a cookie holding an ID (which maps to your session data) - you put the session data in the cookie. That way - HTTP traffic can simply be round-robin'ed to any web-server - as the cookie provides all the info about the user. And what info is that - all I can see in Vanilla's session info is a UserID (and a blank Password?? - which I'm sure isn't needed). Of course - security is more of an issue with client-side sessions - but can be solved in many ways. I have developed this technique for a large website and it works really well (and just as secure). The hard part with Client-Side Sessions is to log someone out (completely) to limit session hijacking. I solved this - and I can explain it if anyone is interested. But I 'm pretty sure Vanilla could solve this more easily .. though hey .. enough detail. I wrote too much an hour ago..

Anyway - I'd like to know if anyone has looked into doing this - and what might be involved. I think Vanilla could use an 'alternate client-side session manager' for those wishing to scale. Perhaps it could be an option - I think the feature would be welcomed by many wishing to grow. From my quick poking - I see the People.Class.Session class could be replaced - while needing some modification to People.Class.Authenticator.

Is it that simple?
«13

Comments

  • ADMADM
    edited February 2007
    Some good questions there. To answer your question about caching, it's currently not possible due to the fact that whispers are present on the main dicussion pages and as such they differ between users.

    Though I guess Vanilla could cache things like total post counts for each topic so it doesn't have to query that as often. I know that vBulletin does that (well at least has the option to do that).
  • I asked Mark already, when joining, about the fulltext point. His answer is probably in a search result.
  • edited February 2007
    !!! I don't know anything about load balancing,

    Vanilla use the php built-in session manager. By default php save the session data in local files but you can set php to save them in a database..
  • @sukibabee
    Hi, I reply publicly because it may be useful for others.
    I apologize for my imprecise statement, the relevant comment is on VanillaDev.
  • edited February 2007
    Thank you all for your answers.

    It's a shame about caching. After looking into Vanilla code more, I see it would be really difficult to get right - especially given the extensible nature of the product. Plus it's not really aimed for large forums. Fortunately, throwing more hardware at the problem suffices.

    Dinoboff, I think you are talking about the session_set_save_handler() routine to define user-defined session storage routines, right? Thanks for the pointer. You're right that, provided the callbacks are established before session_start(), I can easily override the storage of the PHP Session ID, and redirect storage to (for instance) a database. PHP itself still creates that Session ID (with extreme low-prob that it will conflict with another server) - but you can simply ignore it. You can use custom cookies and implement everything yourself.

    It may be a shock to Vanilla users ... but Databases are slow man. You wanna avoid them - esp on high traffic sites where load-balancing is required. In my case, I'd like to integrate Vanilla on a high-traffic site that already has an efficient client-side session mechanism. Retrieving session variables from the Vanilla database on every page (to enable a user access to other parts of the site) isn't acceptable. Most discussions here relate to molding a site to Vanilla, whereas I'd like to do the opposite.

    So - for others wishing to do the same, check out session_set_save_handler(). I haven't tried, but I think it's the cleanest way to alter Vanilla's session handling.

    A see a quick'n'dirty way also : achieved by using Vanilla's "PersistentSession" (remember me) logic, and assigning different session names (the default is "PHPSESSID") on each servers. A persistent session is stored in the database, but only retrieved once for each web-server (which creates its own session, with a unique session cookie name). The "remember me" would have to be forced on, and the logout code needs to clear all the web-servers' session cookies .. and session data can't change. But - it's quick.
  • why couldn't session data change? Couldn't it just be sync'd across the servers when it did change? I mean it should change too often should it?
  • Hey Stash - yes, you're right. You'd just have to delete all the session cookies and the database would be read again using the same "PersistentSession" mechanism. Damn you guys are smart. :)
  • @sukibabee: in your quest for scalable forums, are there other candidates you are considering/like better? @ADM: you've just given me another reason to hate/disable whispers... by the way, I see little reason why (without whispers) if would be impossible to cache certain pages (even if it were just for 30 seconds). I once ran a pretty large and forum that 'peaked' between 5 and 8 pm daily. We found that even short query caching expiry times (e.g. 5 seconds) made a HUGE difference in resource requirements during peak traffic... Further optimization, incl. 'page-chunk' caching, user level differentiation (e.g. guests were read-only, no posting privileges, hence it made very little difference to them if they were looking at a cached page that was 5 mins or a 'fresh' page that was 5 seconds old) allowed us to handle 20x more concurrent users.
  • @tomtester : I have a few suggetions for you.

    I found apples and oranges in the forum world. let's call them "bad apples" and oranges. The bad apples all look alike - I find them unacceptable from a user-interface perspective (and not easily corrected). These are written by the engineer, so a lot of them are built to scale (phpBB for instance). For the oranges, I found only two : Vanilla and bbPress. I stopped looking when I found Vanilla, there may be more. So suggestion 1 is : look for more here - http://en.wikipedia.org/wiki/Comparison_of_Internet_forum_software

    Some form of caching is built into bbPress (I'm not sure how much) - but from the little I know, the caching is broken right now. bbPress doesn't index keywords, but does use the "fulltext" feature of mysql. It is very likely than bbPress will scale much better than Vanilla. But the main problem with bbPress is that it is still being developed. It isn't as slick as Vanilla, but that may also change down the road. So suggestion 2 : check bbPress out.

    Another is to build caching in Vanilla. I'm real new here, so take what I say with a grain of salt (everything below is 'my opinion', not fact) :

    While I respect and consider what others say, I actually think building a caching mechanism in Vanilla is possible - just pretty darn hard. I'll stick to discussing database query caching, since that interests you. So, there's smart caching and dumb caching :

    Dumb Caching : As you point out, building a "dumb cache" is dirty but very effective. it involves caching data for a page and blindly using it for a set period of time (you used 5 seconds). It said it can't be done because of "whispering". i think it just changes the implementation of the cache. Take the main discussions page. When Bob is signed-in, he sees all the discussions like most people see them, but popular Sally (who gets a lot of whispers), sees a different list. To do effective caching, you need to cache per user. Imagine the cache is a directory structure, the first directory is the user_id and cached data held under them. So when Bob accesses the main page, a cache file is generated and put into his cache directory. When he refreshes the page, his cache file is fed back to him. Same for Sally. You could argue that dumb caching per user doesn't help, but visitors will count as one user - so all visitors share the same cache file. Still - it's pretty weak. If you turn off whispering, maybe you don't need per user caching (but I still think you do - see my note below about this).

    Smart Caching : This is where you cache data and use it until the data changes. This is done by purging the cache file (aka invalidating the cache). The trick is to detect when the cache should be invalidated. If whispering is turned on, and caching is required per user, invalidating everyone's cache files becomes a challenge. It's not acceptable to go remove a thousand cache files (one per user). So a global cache directory should be made, and timestamps compared as part of cache fetching. Per user caching would be very effective in this case. But if whispering is turned off, per user caching *may* not be required (but I still think so) - and would greatly simplify the implementation.

    Up till now, I've been talking about caching data for 'pages' - when in fact the code gets data in terms of a database query. For dumb or smart caching, the cache has to be keyed by the inputs to the database query. Fortunately, the code is already geared for this. Database queries are constructed by calls to the SqlBuilder class. The Select() and GetRow() calls of the Database class are the points to implement the caching mechanism. The cache filename is a construction of all the inputs to the database query. This all sounds easy until you look at the vast number of queries that take place in Vanilla. And some of these inputs may be things like the current time or others that constantly change. So it has to be somewhat smart. It's kinda scary. Cache invalidation would have to happen in the same place - calls in the database to update the data. I think only Mark could write the algorithm to match select inputs with update inputs to do cache invalidation. It looks way too complicated. I doubt he would be interested to do this, it looks really hard.

    The other option is to cache data at certain places outside the Database class. But this requires invalidation to be done at certain places too. Since it is not a generic cache, it'll likely break and be difficult to maintain.

    so, the note about whispering : The other thing I see unique per user is the highlighting showing which entries have been read and which are new. This is done via the LUM_UserDiscussionWatch table. When a user views a discussion, a timestamp entry into this table is updated. So the main discussions page will still be a unique database fetch per user, even without whispering activated. So I think that feature makes per-user caching mandatory also.

    So my suggestion 3 : don't even try to implement a cache in Vanilla. In writing this up, I've convinced myself that it's just too hard man.

    Suggestion 4 : search for "FARM_DATABASE_HOST" - I don't know much about it, but it looks like Vanilla somehow has built-in support for a database farm. Combine that with web-server load-balancing, and your forum scales - though at a cost.
  • I whish there was a easy way to invaidate a browser-side cache. I do believe I found a easy way to smart cache easily for guests... could be expanded to work for some users. My shared hosting is atrocous in database retrival times. Maybe I'll give it a go.
  • edited February 2007
    WallPhone - can't wait to hear what you come up with. I thought of something else : It seems the two features of Vanilla that hold back (global - not per user) caching is whispering and the visual cues that depict whether or not you've read a discussion. If you disable both of these features, there's nothing to stop caching the data for the discussions page for both guests and users. Am I right? Everyone knows how to disable whispering. The quick way to disable the "have I read this" cue is to change the style-sheet. Your users will learn to use the forum without that feature. In this case, caching shouldn't be so hard. It would be better to invalidate the cache, rather than have a short-timeout. You just need to identify the points of code where invalidation needs to take place ... e.g. when a comment is added, a sink is done... etc. Grep for Update - can't be too hard, but there are probably a lot of these. The same technique may work for the data associated with viewing each discussion - but I haven't looked into this. It may involve user-specific database queries (even with whispering disabled?). not sure. With the visual cues disabled, it would also be beneficial to disable the database update to the LUM_UserDiscussionWatch table that takes place every time a user opens a discussion. In a replicated database environment, write rates are the limiting factor. I'm guessing this would help a lot.
  • @Suki

    I'm quite familiar with various cache levels/methods.

    'REAL-TIME'
    Experience has taught me that REAL-TIME data is rarely required in a forum setting (it's not
    instant messaging). Just like a CD-Player can give a good representation of an analog signal,
    'dumb-cached' pages/page chunks can provide a 'representation' of a forum that works 'good
    enough' for most users.

    STUPID is SMART
    Since I'm pretty sure nobody can anticipate ALL exceptions, not even if you're Mark, and
    complexity ALWAYS increases the # of bugs, I'm a big proponent of simple (80%) solutions
    and thus 'dumb caching' pages or page elements if the way to go for me.

    SCALE & SACRIFICE
    I'd be more than happy to sacrifice some of the 'NICE BUT NOT REQUIRED' features for speed &
    performance...

    I'm sure even the exact features that PREVENT caching could be replaced by *similar* ones that
    do not require a user-specific database lookup (e.g. local cookies) or can be SPLIT into generic
    and user-specific queries (where the former could be cached).

    Finally, whispers could perhaps be re-coded, e.g. retrieved as a separate, non-cached query and
    then merged on the server, or even off-loaded to the user's machine via JavaScript for page
    creation, interleaving 'regular' comments and whispers?

    Just a thought...
  • "Some good questions there. To answer your question about caching, it's currently not possible due to the fact that whispers are present on the main dicussion pages and as such they differ between users." Present whispers as a new tab\page?
  • TomTester, I like the idea of the whispers perhaps being separated and then merged in, if that's possible.
  • edited February 2007
    silly comment - self-censorship activated
  • TomTesterTomTester New
    edited February 2007
    @Stash: of course it should be possible... anything is possible... this is the U.S. of A! ;-)
    I should really stop proposing things and look at the code myself.

    @Toivo: the 'strength' of whispers however is the fact that they appear in-line with the
    other comments. A separate TAB makes them into private messages (not bad,
    but also not the same).
  • Actually, this is the UK, the Ukraine, Australia, Germany... but hey ;)
  • hahah
  • ToivoToivo New
    edited February 2007
    aah, but the strength of whispers also comes from the fact that these look like ordinary comments and you post them just like ordinary comments (same look, same logic, as clean as)?
  • exactly they allow the, private sub conversation to flow along with the main one.
Sign In or Register to comment.