Get Version


→ ‘extractcontent’


This module is to extract the text from web page ( html content ). Automatically extracts sub blocks of html which have much possibility that it is the text ( except for menu, comments, navigations, affiliate links and so on ).


sudo gem install extractcontent

The basics

- separating blocks from html, calculating score of blocks and ignoring low score blocks. - for score calculation, using block arrangement, text length, whether has affiliate links or characteristic keywords - clustering continuous, high score blocks and comparing amang clusters - if including “Google AdSense Section Target”, noticing it in particular

Demonstration of usage

$KCODE="u" # necessary if Japanese
require 'rubygems'
require 'extractcontent.rb'

# Constractor
opt = {:decay_factor=>0.75} # optional settings
extractor = ExtractContent::Extractor.new(opt)

html = '<html> ~~~ </html>' # target html 
body, title = extractor.analyse(html) # analyse



How to submit patches

Read the 8 steps for fixing other people’s code and for section 8b: Submit patch to Google Groups, use the Google Group above.

The trunk repository is svn://rubyforge.org/var/svn/extractcontent/trunk for anonymous access.


This code is free to use under the terms of the BSD license.


Comments are welcome. Send an email to “extractcontent at googlegroups.com” email via the forum

Copyright (c) 2007/2008 Nakatani Shuyo / Cybozu Labs Inc. All rights reserved.
Theme extended from Paul Battley