Please upgrade here. These earlier versions are no longer being updated and have security issues.
HackerOne users: Testing against this community violates our program's Terms of Service and will result in your bounty being denied.

"Undefined foreign content" when fetching wordpress post

ludalexludalex New
edited May 2012 in Vanilla 2.0 - 2.8

I set up Vanilla 2.0.18.4 (w/ Vanilla plugin) and Wordpress (w/ Vanilla Forums). All updated to the lastest version.

I correctly connected the Vanilla install through the wp plugin, choose the forum category where the wp posts end up.

When someone comments, it does create a new post in the chosen category, but the content is just a broken link to the post and the title is Undefined foreign content.

What do?

Best Answer

  • x00x00 MVP
    edited May 2012 Answer ✓

    If vanilla cannot access the site to scrape the title it will not be able to find it. The facility uses FetchPageInfo which uses ProxyRequest which requires curl. FetchPageInfo requires DOMDocument it will first search for the title element, and meta description, if it doesn't find it it will look for the first p element with content greater than 90 characters, and will chop it at 400 characters.

    I not really sure why they implemented it this way. Personally I would have use the excellent JSON api that is available for a plugin with wordpress.

    grep is your friend.

Answers

  • ludalexludalex New
    edited May 2012

    Vanilla 2.0.18.4 (w/ Vanilla plugin) - I meant w/ < embed > Vanilla plugin.

  • x00x00 MVP
    edited May 2012 Answer ✓

    If vanilla cannot access the site to scrape the title it will not be able to find it. The facility uses FetchPageInfo which uses ProxyRequest which requires curl. FetchPageInfo requires DOMDocument it will first search for the title element, and meta description, if it doesn't find it it will look for the first p element with content greater than 90 characters, and will chop it at 400 characters.

    I not really sure why they implemented it this way. Personally I would have use the excellent JSON api that is available for a plugin with wordpress.

    grep is your friend.

  • x00 said:
    If vanilla cannot access the site to scrape the title it will not be able to find it. The facility uses FetchPageInfo which uses ProxyRequest which requires curl. FetchPageInfo requires DOMDocument it will first search for the title element, and meta description, if it doesn't find it it will look for the first p element with content greater than 90 characters, and will chop it at 400 characters.

    I not really sure why they implemented it this way. Personally I would have use the excellent JSON api that is available for a plugin with wordpress.

    well then what's the problem?

  • I don't know I'm, pointing you in the right direction, but I'm not goign to investigate further.

    grep is your friend.

  • Simple gotacha is a private site. Scraping need to be done on public content, if not this solution is not suitable.

    grep is your friend.

  • ludalexludalex New
    edited May 2012

    x00 said:
    Simple gotacha is a private site. Scraping need to be done on public content, if not this solution is not suitable.

    I've put the blog offline to non-administrator with a plugin, could that be the problem?

  • likely, basically if it can't access publicly it via curl it can't scrape the title.

    This solution only works with public content.

    grep is your friend.

  • nope, tried disabling it and commenting a post, still get "Undefined foreign content" post from the "System" user with a broken link to the wp post.

  • well I would work through the dependencies.

    grep is your friend.

  • x00 said:
    well I would work through the dependencies.

    what do you mean?

  • I said I could only take this so far, that is my lot. I mentioned the dependencies of this system above, if you don't know you need to find someone who would be able to do that for you. It is not something that can just be sorted out on the discussion.

    grep is your friend.

  • I'm surprised to be the only one to have this issue.. I performed various google searches and apparently no one had my same problem.

  • MarkMark Vanilla Staff

    You get "undefined foreign content" when Vanilla fails to retrieve the page in question. So, either your page is unavailable to unauthenticated users (ie. in draft mode), or curl is not set up or working properly.

  • Mark said:
    You get "undefined foreign content" when Vanilla fails to retrieve the page in question. So, either your page is unavailable to unauthenticated users (ie. in draft mode), or curl is not set up or working properly.

    This is the curl configuration of the server I'm using.

  • Shadowdare said:
    @futuretalk and @ludalex:

    The Vanilla WP plugin should do it automatically, and if not then all you have to do is modify the comment links on your template so that it looks similar to this:

    <a href="http://yourdomain.com/path/to/page/with/comments/#vanilla_comments" vanilla-identifier="embed-test">Comments</a>

    Using this advice it now works.
    But why does it fetch BLOG_NAME » BLOG_POST as title and the same thing with a thumbnail of blog's logo and metalink as content? Shouldn't it fetch post name and content?

  • x00x00 MVP
    edited May 2012

    that was their design I'm not a fan of it, but hey fetch whatever the title element is. So what you could do is make sure the title of post is exactly that.

    You can also override the FetchPageInfo function. create conf/boostrap.before.php and put

    <?php if (!defined('APPLICATION')) exit();
    
    if (!function_exists('FetchPageInfo')) {
       /**
        * Examines the page at $Url for title, description & images. Be sure to check the resultant array for any Exceptions that occurred while retrieving the page. 
        * @param string $Url The url to examine.
        * @param integer $Timeout How long to allow for this request. Default Garden.SocketTimeout or 1, 0 to never timeout. Default is 0.
        * @return array an array containing Url, Title, Description, Images (array) and Exception (if there were problems retrieving the page).
        */
       function FetchPageInfo($Url, $Timeout = 0) {
          $PageInfo = array(
             'Url' => $Url,
             'Title' => '',
             'Description' => '',
             'Images' => array(),
             'Exception' => FALSE
          );
          try {
             $PageHtml = ProxyRequest($Url, $Timeout, TRUE);
             $Dom = new DOMDocument();
             @$Dom->loadHTML($PageHtml);
             // Page Title
             $TitleNodes = $Dom->getElementsByTagName('title');
             $PageInfo['Title'] = $TitleNodes->length > 0 ? $TitleNodes->item(0)->nodeValue : '';
    
             /*
             *Do some string manipulation here 
             * 
             * e.g.  $PageInfo['Title']=substr($PageInfo['Title'],stripos('» ',$PageInfo['Title']));
             */
    
    
    
             // Page Description
             $MetaNodes = $Dom->getElementsByTagName('meta');
             foreach($MetaNodes as $MetaNode) {
                if (strtolower($MetaNode->getAttribute('name')) == 'description')
                   $PageInfo['Description'] = $MetaNode->getAttribute('content');
             }
             // Keep looking for page description?
             if ($PageInfo['Description'] == '') {
                $PNodes = $Dom->getElementsByTagName('p');
                foreach($PNodes as $PNode) {
                   $PVal = $PNode->nodeValue;
                   if (strlen($PVal) > 90) {
                      $PageInfo['Description'] = $PVal;
                      break;
                   }
                }
             }
             if (strlen($PageInfo['Description']) > 400)
                $PageInfo['Description'] = SliceString($PageInfo['Description'], 400);
    
             // Page Images (retrieve first 3 if bigger than 100w x 300h)
             $Images = array();
             $ImageNodes = $Dom->getElementsByTagName('img');
             $i = 0;
             foreach ($ImageNodes as $ImageNode) {
                $Images[] = AbsoluteSource($ImageNode->getAttribute('src'), $Url);
             }
    
             // Sort by size, biggest one first
             $ImageSort = array();
             // Only look at first 10 images (speed!)
             $i = 0;
             foreach ($Images as $Image) {
                $i++;
                if ($i > 10)
                   break;
    
                list($Width, $Height, $Type, $Attributes) = getimagesize($Image);
                $Diag = (int)floor(sqrt(($Width*$Width) + ($Height*$Height)));
                if (!array_key_exists($Diag, $ImageSort))
                   $ImageSort[$Diag] = $Image;
             }
             krsort($ImageSort);
             $PageInfo['Images'] = array_values($ImageSort);
          } catch (Exception $ex) {
             $PageInfo['Exception'] = $ex;
          }
          return $PageInfo;
       }
    }
    ?>
    

    Do string manipulation as appropriate.

    grep is your friend.

Sign In or Register to comment.