Shuyo’s Weblog

About my favorite technical issues

Archive for the ‘text analysis’ Category

Extract body of Project Gutenberg’s text

Posted by shuyo on November 24, 2008

Project Gutenberg’s texts are convenient to experiment text analysis and information retrieval.
But they have header and footer which are extremely free format…

I tried to write a Ruby script to extract body of Project Gutenberg’s texts.
For Project Gutenberg’s 2003 CD ( published at CD and DVD Project ), its precision is about 98%.

def extract_gp_body(st)
  text = st.gsub(/[ \r]+$/, "") + "\n\n"
  text.gsub!(/<-- .+? -->/m, "")
  text.gsub!(/<HTML>.+?<\/HTML>/mi, "")

  r = /http|internet|project gutenberg|mail|ocr/i
  while text =~ /^(?:.+?END\*{1,2} ?|\*{3} START OF THE PROJECT GUTENBERG E(?:BOOK|TEXT).*? \*{3}|\*{9}END OF .+?|\*{3} END OF THE PROJECT GUTENBERG E(?:BOOK|TEXT).+?|\*{3}START\*.+\*START\*{3}|\**This file should be named .+|\*{5}These [eE](?:Books|texts) (?:Are|Were) Prepared By .+\*{5})$/
    pre, post = $`, $'
    text = if pre.length > post.length*3 then
      pre
    elsif post.length > pre.length*3 then
      post
    elsif pre.scan(r).length < post.scan(r).length
      pre
    else
      post
    end
  end

  text.gsub!(/^(?:Executive Director's Notes:|\[?Transcriber's Note|PREPARER'S NOTE|\[Redactor's note|\{This e-text has been prepared|As you may be aware, Project Gutenberg has been involved with|[\[\*]Portions of this header are|A note from the digitizer|ETEXT EDITOR'S BOOKMARKS|\[NOTE:|\[Project Gutenberg is|INFORMATION ABOUT THIS E-TEXT EDITION\n+|If you find any errors|This electronic edition was|Notes about this etext:|A request to all readers:|Comments on the preparation of the E-Text:|The base text for this edition has been provided by).+?\n(?:[\-\*]+)?\n\n/mi, "")
  text.gsub!(/^[\[\n](?:[^\[\]\n]+\n)*[^\n]*(?:Project\sGutenberg|\setext\s|\s[A-Za-z0-9]+@[a-z\-]+\.(?:com|net))[^\n]*(?:\n[^\[\]\n]+)*[\]\n]$/i, "")
  text.gsub!(/\{The end of etext of .+?\}/, "")
  text = text.strip + "\n\n"

  text.gsub!(/^(?:(?:End )?(?:of ?)?(?:by |This |The )?Project Gutenberg(?:'s )?(?:Etext)?|This (?:Gutenberg )?Etext).+?\n\n/mi, "")
  text.gsub!(/^(?:\(?E?-?(?:text )?(?:prepared|Processed|scanned|Typed|Produced|Edited|Entered|Transcribed|Converted) by|Transcribed from|Scanning and first proofing by|Scanned and proofed by|This e-text|This EBook of|Scanned with|This Etext created by|This eBook was (?:produced|updated) by|Image files scanned in by|\[[^\n]*mostly scanned by).+?\n\n/mi, "")

  return text
end

Original Japanese article: Extract body of Project Gutenberg’s text

Posted in text analysis | Tagged: | Leave a Comment »