Shuyo’s Weblog

About my favorite technical issues

Extract body of Project Gutenberg’s text

Posted by shuyo on November 24, 2008

Project Gutenberg’s texts are convenient to experiment text analysis and information retrieval.
But they have header and footer which are extremely free format…

I tried to write a Ruby script to extract body of Project Gutenberg’s texts.
For Project Gutenberg’s 2003 CD ( published at CD and DVD Project ), its precision is about 98%.

def extract_gp_body(st)
  text = st.gsub(/[ \r]+$/, "") + "\n\n"
  text.gsub!(/<-- .+? -->/m, "")
  text.gsub!(/<HTML>.+?<\/HTML>/mi, "")

  r = /http|internet|project gutenberg|mail|ocr/i
  while text =~ /^(?:.+?END\*{1,2} ?|\*{3} START OF THE PROJECT GUTENBERG E(?:BOOK|TEXT).*? \*{3}|\*{9}END OF .+?|\*{3} END OF THE PROJECT GUTENBERG E(?:BOOK|TEXT).+?|\*{3}START\*.+\*START\*{3}|\**This file should be named .+|\*{5}These [eE](?:Books|texts) (?:Are|Were) Prepared By .+\*{5})$/
    pre, post = $`, $'
    text = if pre.length > post.length*3 then
      pre
    elsif post.length > pre.length*3 then
      post
    elsif pre.scan(r).length < post.scan(r).length
      pre
    else
      post
    end
  end

  text.gsub!(/^(?:Executive Director's Notes:|\[?Transcriber's Note|PREPARER'S NOTE|\[Redactor's note|\{This e-text has been prepared|As you may be aware, Project Gutenberg has been involved with|[\[\*]Portions of this header are|A note from the digitizer|ETEXT EDITOR'S BOOKMARKS|\[NOTE:|\[Project Gutenberg is|INFORMATION ABOUT THIS E-TEXT EDITION\n+|If you find any errors|This electronic edition was|Notes about this etext:|A request to all readers:|Comments on the preparation of the E-Text:|The base text for this edition has been provided by).+?\n(?:[\-\*]+)?\n\n/mi, "")
  text.gsub!(/^[\[\n](?:[^\[\]\n]+\n)*[^\n]*(?:Project\sGutenberg|\setext\s|\s[A-Za-z0-9]+@[a-z\-]+\.(?:com|net))[^\n]*(?:\n[^\[\]\n]+)*[\]\n]$/i, "")
  text.gsub!(/\{The end of etext of .+?\}/, "")
  text = text.strip + "\n\n"

  text.gsub!(/^(?:(?:End )?(?:of ?)?(?:by |This |The )?Project Gutenberg(?:'s )?(?:Etext)?|This (?:Gutenberg )?Etext).+?\n\n/mi, "")
  text.gsub!(/^(?:\(?E?-?(?:text )?(?:prepared|Processed|scanned|Typed|Produced|Edited|Entered|Transcribed|Converted) by|Transcribed from|Scanning and first proofing by|Scanned and proofed by|This e-text|This EBook of|Scanned with|This Etext created by|This eBook was (?:produced|updated) by|Image files scanned in by|\[[^\n]*mostly scanned by).+?\n\n/mi, "")

  return text
end

Original Japanese article: Extract body of Project Gutenberg’s text

Posted in text analysis | Tagged: | Leave a Comment »

JSRuby – Ruby interpreter impremented in JavaScript

Posted by shuyo on November 17, 2008

JSRuby is a Ruby interpreter impremented in JavaScript.

JSRuby Project Page (CodeRepos)
http://coderepos.org/share/wiki/JSRuby

It is based on Ruby 1.8 and implements yet partly.

Its points are the following:

  • implement also ruby parser, so can execute Ruby scripts on browser oneself
  • can handle any Javascript objects in JSRuby scripts

JSRuby supports parser and interpreter, so it can also execute as bookmarklet.

And as an experimental implement, It’s supporting sleep function.

Related blog entries:
- JSRuby 1.0 Released – Ruby interpreter impremented in JavaScript (in Japanese)
- Using JQuery on JSRuby – How to connect Ruby to Javascript (in Japanese)
- Asynchronous JSRuby – Experimental implementation of sleep (in Japanese)

Posted in Uncategorized | Leave a Comment »

Outputz – record your output volume

Posted by shuyo on November 10, 2008

Your writing blog’s article, twitter, bbs and so on. Have you felt that you want to know how much to write? Today? Yesterday? On twitter?

Outputz is a Firefox-addin which record your output volume.
As you install Outputz, its addin counts all your post text and sends number of written bytes.
You can see your output volume as hourly, daily, weekly, monthly, yearly and per domain.

2008 Outputz
236,425bytes

powered by Outputz.

This is my output during about October.
You can embed your output graph on blog like this!

Posted in Uncategorized | Leave a Comment »

iVoca – Word Typing Game to Remember

Posted by shuyo on November 7, 2008

iVoca (i-Vocabulary) is a word typing game which is enable you to remember words quickly.

This is a screenshot of iVoca.

Now, as you’ll push “Typing Start” button at the title screen, a word card is falling.
The word cards have a question respectively, so you’ll type its answer quickly.
If you don’t know one, its answer gradually appears on the card.

You only need to type alphabets included in words you want to remember.

iVoca also enables to create BOOK, question words list, by yourself.

This service supports OpenID authentication, but at present, supported OpenID providers are only Japanese ones.
But unregistered user can play iVoca. (cannot create BOOK…)

Examples of BOOKs are following.

Shortly, I’ll create English to Italian, or Italian to English BOOK!

Posted in Uncategorized | Tagged: , | Leave a Comment »

Implementations of URI Templates in various languages

Posted by shuyo on July 24, 2008

Before the URI Templates implementation in C++/Xbyak, I researched implementations of URI Templates.

Python (Joe Gregorio himself’s experimental implementation)
http://code.google.com/p/uri-templates/
Ruby – Addressable
http://addressable.rubyforge.org/
Ruby – uri-templates
http://github.com/juretta/uri-templates/tree/master
Javascript – url_template
http://www.mnot.net/javascript/url_template/
Javascript – Template
http://www.snellspace.com/wp/?p=831
Perl – URI::Template
http://search.cpan.org/~bricas/URI-Template/
.NET – UriTemplate
http://msdn.microsoft.com/en-us/library/system.uritemplate.aspx
PHP – URI_Template
http://pear.php.net/package/URI_Template/download/
Java – Apache Abdera
http://cwiki.apache.org/ABDERA/uri-templates.html
Java – Metanotion URLMapper
http://www.metanotion.net/software/urlmapper/
Erlang – uri-template
http://tfletcher.com/dev/erlang-uri-template
C++ – URITemplates
http://shuyo.wordpress.com/2008/07/17/jit-compiler-for-uri-templates-cxbyak/

Some of them implement “extract” method which isn’t mentioned by URI Templates specifications. Now “extract” means extraction of parameters from URI.
To me, URI Templates is valuable if it implements extend and extract both. Because we often need to generate and parse the same format URIs when our application meets connectivity(connectedness). Of course, we can generate URI by string join and parse by regular expressions, but are prone to mistake in changing format.
So I hope that URI Templates include “extract” specifications. It’ll bring about restriction of templates, however…(e.g. Extraction by template “http://example.com/{foo}{bar}” isn’t well defined.)

Original Japanese article: Implementations of URI Templates in various languages

Posted in URI Templates | Tagged: | Leave a Comment »

JIT Compiler for URI Templates (C++/Xbyak)

Posted by shuyo on July 17, 2008

URI Templates ( http://bitworking.org/projects/URI-Templates/ ) provides the methods to generate similar URIs.

For example, a template “/weather/{state}/{city}?forecast={day}” means that it has 3 parameters “state”, “city” and “day”, therefore it can generate URI when giving values for each parameter.

Is it interesting that URI Templates is implemented as JIT compiler…?

So, I tried JIT Compiler for URI Templates with Xbyak.

URI Templates for C++
http://coderepos.org/share/browser/lang/cplusplus/URITemplates

Xbyak is a header file library which can dynamically generate x86 binary programs while code is running.

As the previous template example “/weather/{state}/{city}?forecast={day}” is given to this JIT compiler class URITemplatesJIT, it generates the following x86 code dynamically.

003959D0 53               push        ebx
003959D1 56               push        esi
003959D2 8B 4C 24 0C      mov         ecx,dword ptr [esp+0Ch]
003959D6 8B 74 24 10      mov         esi,dword ptr [esp+10h]
003959DA B8 13 00 00 00   mov         eax,13h
003959DF C6 01 77         mov         byte ptr [ecx],77h ; 'w'
003959E2 C6 41 01 65      mov         byte ptr [ecx+1],65h ; 'e'
003959E6 C6 41 02 61      mov         byte ptr [ecx+2],61h ; 'a'
003959EA C6 41 03 74      mov         byte ptr [ecx+3],74h ; 't'
003959EE C6 41 04 68      mov         byte ptr [ecx+4],68h ; 'h'
003959F2 C6 41 05 65      mov         byte ptr [ecx+5],65h ; 'e'
003959F6 C6 41 06 72      mov         byte ptr [ecx+6],72h ; 'r'
003959FA C6 41 07 2F      mov         byte ptr [ecx+7],2Fh ; '/'
003959FE 8D 49 08         lea         ecx,[ecx+8]
00395A01 8B 16            mov         edx,dword ptr [esi]
00395A03 EB 05            jmp         00395A0A
00395A05 88 19            mov         byte ptr [ecx],bl
00395A07 40               inc         eax
00395A08 41               inc         ecx
00395A09 42               inc         edx
00395A0A 8A 1A            mov         bl,byte ptr [edx]
00395A0C 84 DB            test        bl,bl
00395A0E 75 F5            jne         00395A05
00395A10 C6 01 2F         mov         byte ptr [ecx],2Fh ; '/'
00395A13 41               inc         ecx
00395A14 8B 56 04         mov         edx,dword ptr [esi+4]
00395A17 EB 05            jmp         00395A1E
00395A19 88 19            mov         byte ptr [ecx],bl
00395A1B 40               inc         eax
00395A1C 41               inc         ecx
00395A1D 42               inc         edx
00395A1E 8A 1A            mov         bl,byte ptr [edx]
00395A20 84 DB            test        bl,bl
00395A22 75 F5            jne         00395A19
00395A24 C6 01 3F         mov         byte ptr [ecx],3Fh ; '?'
00395A27 C6 41 01 66      mov         byte ptr [ecx+1],66h ; 'f'
00395A2B C6 41 02 6F      mov         byte ptr [ecx+2],6Fh ; 'o'
00395A2F C6 41 03 72      mov         byte ptr [ecx+3],72h ; 'r'
00395A33 C6 41 04 65      mov         byte ptr [ecx+4],65h ; 'e'
00395A37 C6 41 05 63      mov         byte ptr [ecx+5],63h ; 'c'
00395A3B C6 41 06 61      mov         byte ptr [ecx+6],61h ; 'a'
00395A3F C6 41 07 73      mov         byte ptr [ecx+7],73h ; 's'
00395A43 C6 41 08 74      mov         byte ptr [ecx+8],74h ; 't'
00395A47 C6 41 09 3D      mov         byte ptr [ecx+9],3Dh ; '='
00395A4B 8D 49 0A         lea         ecx,[ecx+0Ah]
00395A4E 8B 56 08         mov         edx,dword ptr [esi+8]
00395A51 EB 05            jmp         00395A58
00395A53 88 19            mov         byte ptr [ecx],bl
00395A55 40               inc         eax
00395A56 41               inc         ecx
00395A57 42               inc         edx
00395A58 8A 1A            mov         bl,byte ptr [edx]
00395A5A 84 DB            test        bl,bl
00395A5C 75 F5            jne         00395A53
00395A5E C6 01 00         mov         byte ptr [ecx],0
00395A61 5E               pop         esi
00395A62 5B               pop         ebx
00395A63 C3               ret

How performance does this JIT compiler improve?
I measured performances of 3 type implementations including JIT compiler.

  • URITemplatesNormal : an ordinary implementation with const_iterator
  • URITemplatesRegex : an experimental implementation with boost::regex
  • URITemplatesJIT : JIT compiler with Xbyak

The following is a result of 3 type programs.

==== URITemplatesRegex
city:Redmond
day:today
state:Washington
>> test1(extract) : 6.062usec
weather/Washington/Redmond?forecast=Today
>> test2(extend) : 12.094usec

==== URITemplatesNormal
city:Redmond
day:today
state:Washington
>> test1(extract) : 4.344usec
weather/Washington/Redmond?forecast=Today
>> test2(extend) : 1.842usec

==== URITemplatesJIT
city:Redmond
day:today
state:Washington
>> test1(extract) : 3.188usec
weather/Washington/Redmond?forecast=Today
>> test2(extend) : 0.72usec

JIT’s extend method (create a new URI from parameters) is 2.5 times faster than Normal’s.
But JIT’s extract method (extract parameters from URI) is only 1.3 times faster.
It is because substitution of std::map has considerable overhead. Indeed, It takes 70% of The execution time of JIT’S extract!

It is very simple implementation which only complies with URI Templates draft-01.
If more complicated logic such as draft-03, I expect that the difference will increase more…

Now, to tell the truth, this is my first study work of C++! :P

Posted in C++, URI Templates | Tagged: , | 1 Comment »