Shuyo’s Weblog

About my favorite technical issues

Archive for November, 2008

Extract body of Project Gutenberg’s text

Posted by shuyo on November 24, 2008

Project Gutenberg’s texts are convenient to experiment text analysis and information retrieval.
But they have header and footer which are extremely free format…

I tried to write a Ruby script to extract body of Project Gutenberg’s texts.
For Project Gutenberg’s 2003 CD ( published at CD and DVD Project ), its precision is about 98%.

def extract_gp_body(st)
  text = st.gsub(/[ \r]+$/, "") + "\n\n"
  text.gsub!(/<-- .+? -->/m, "")
  text.gsub!(/<HTML>.+?<\/HTML>/mi, "")

  r = /http|internet|project gutenberg|mail|ocr/i
  while text =~ /^(?:.+?END\*{1,2} ?|\*{3} START OF THE PROJECT GUTENBERG E(?:BOOK|TEXT).*? \*{3}|\*{9}END OF .+?|\*{3} END OF THE PROJECT GUTENBERG E(?:BOOK|TEXT).+?|\*{3}START\*.+\*START\*{3}|\**This file should be named .+|\*{5}These [eE](?:Books|texts) (?:Are|Were) Prepared By .+\*{5})$/
    pre, post = $`, $'
    text = if pre.length > post.length*3 then
      pre
    elsif post.length > pre.length*3 then
      post
    elsif pre.scan(r).length < post.scan(r).length
      pre
    else
      post
    end
  end

  text.gsub!(/^(?:Executive Director's Notes:|\[?Transcriber's Note|PREPARER'S NOTE|\[Redactor's note|\{This e-text has been prepared|As you may be aware, Project Gutenberg has been involved with|[\[\*]Portions of this header are|A note from the digitizer|ETEXT EDITOR'S BOOKMARKS|\[NOTE:|\[Project Gutenberg is|INFORMATION ABOUT THIS E-TEXT EDITION\n+|If you find any errors|This electronic edition was|Notes about this etext:|A request to all readers:|Comments on the preparation of the E-Text:|The base text for this edition has been provided by).+?\n(?:[\-\*]+)?\n\n/mi, "")
  text.gsub!(/^[\[\n](?:[^\[\]\n]+\n)*[^\n]*(?:Project\sGutenberg|\setext\s|\s[A-Za-z0-9]+@[a-z\-]+\.(?:com|net))[^\n]*(?:\n[^\[\]\n]+)*[\]\n]$/i, "")
  text.gsub!(/\{The end of etext of .+?\}/, "")
  text = text.strip + "\n\n"

  text.gsub!(/^(?:(?:End )?(?:of ?)?(?:by |This |The )?Project Gutenberg(?:'s )?(?:Etext)?|This (?:Gutenberg )?Etext).+?\n\n/mi, "")
  text.gsub!(/^(?:\(?E?-?(?:text )?(?:prepared|Processed|scanned|Typed|Produced|Edited|Entered|Transcribed|Converted) by|Transcribed from|Scanning and first proofing by|Scanned and proofed by|This e-text|This EBook of|Scanned with|This Etext created by|This eBook was (?:produced|updated) by|Image files scanned in by|\[[^\n]*mostly scanned by).+?\n\n/mi, "")

  return text
end

Original Japanese article: Extract body of Project Gutenberg’s text

Posted in text analysis | Tagged: | Leave a Comment »

JSRuby – Ruby interpreter impremented in JavaScript

Posted by shuyo on November 17, 2008

JSRuby is a Ruby interpreter impremented in JavaScript.

JSRuby Project Page (CodeRepos)
http://coderepos.org/share/wiki/JSRuby

It is based on Ruby 1.8 and implements yet partly.

Its points are the following:

  • implement also ruby parser, so can execute Ruby scripts on browser oneself
  • can handle any Javascript objects in JSRuby scripts

JSRuby supports parser and interpreter, so it can also execute as bookmarklet.

And as an experimental implement, It’s supporting sleep function.

Related blog entries:
- JSRuby 1.0 Released – Ruby interpreter impremented in JavaScript (in Japanese)
- Using JQuery on JSRuby – How to connect Ruby to Javascript (in Japanese)
- Asynchronous JSRuby – Experimental implementation of sleep (in Japanese)

Posted in Uncategorized | Leave a Comment »

Outputz – record your output volume

Posted by shuyo on November 10, 2008

Your writing blog’s article, twitter, bbs and so on. Have you felt that you want to know how much to write? Today? Yesterday? On twitter?

Outputz is a Firefox-addin which record your output volume.
As you install Outputz, its addin counts all your post text and sends number of written bytes.
You can see your output volume as hourly, daily, weekly, monthly, yearly and per domain.

2008 Outputz
236,425bytes

powered by Outputz.

This is my output during about October.
You can embed your output graph on blog like this!

Posted in Uncategorized | Leave a Comment »

iVoca – Word Typing Game to Remember

Posted by shuyo on November 7, 2008

iVoca (i-Vocabulary) is a word typing game which is enable you to remember words quickly.

This is a screenshot of iVoca.

Now, as you’ll push “Typing Start” button at the title screen, a word card is falling.
The word cards have a question respectively, so you’ll type its answer quickly.
If you don’t know one, its answer gradually appears on the card.

You only need to type alphabets included in words you want to remember.

iVoca also enables to create BOOK, question words list, by yourself.

This service supports OpenID authentication, but at present, supported OpenID providers are only Japanese ones.
But unregistered user can play iVoca. (cannot create BOOK…)

Examples of BOOKs are following.

Shortly, I’ll create English to Italian, or Italian to English BOOK!

Posted in Uncategorized | Tagged: , | Leave a Comment »