extractcontent
Get Version
0.0.1→ ‘extractcontent’
What
This module is to extract the text from web page ( html content ). Automatically extracts sub blocks of html which have much possibility that it is the text ( except for menu, comments, navigations, affiliate links and so on ).
Installing
sudo gem install extractcontent
The basics
- separating blocks from html, calculating score of blocks and ignoring low score blocks. - for score calculation, using block arrangement, text length, whether has affiliate links or characteristic keywords - clustering continuous, high score blocks and comparing amang clusters - if including “Google AdSense Section Target”, noticing it in particular
Demonstration of usage
$KCODE="u" # necessary if Japanese
require 'rubygems'
require 'extractcontent.rb'
# Constractor
opt = {:decay_factor=>0.75} # optional settings
extractor = ExtractContent::Extractor.new(opt)
html = '<html> ~~~ </html>' # target html
body, title = extractor.analyse(html) # analyse
Forum
http://groups.google.com/group/extractcontent
How to submit patches
Read the 8 steps for fixing other people’s code and for section 8b: Submit patch to Google Groups, use the Google Group above.
The trunk repository is svn://rubyforge.org/var/svn/extractcontent/trunk for anonymous access.
License
This code is free to use under the terms of the BSD license.
Contact
Comments are welcome. Send an email to “extractcontent at googlegroups.com” email via the forum
Copyright (c) 2007/2008 Nakatani Shuyo / Cybozu Labs Inc. All rights reserved.
Theme extended from Paul Battley