Extract body of Project Gutenberg’s text
Posted by shuyo on November 24, 2008
Project Gutenberg’s texts are convenient to experiment text analysis and information retrieval.
But they have header and footer which are extremely free format…
I tried to write a Ruby script to extract body of Project Gutenberg’s texts.
For Project Gutenberg’s 2003 CD ( published at CD and DVD Project ), its precision is about 98%.
def extract_gp_body(st)
text = st.gsub(/[ \r]+$/, "") + "\n\n"
text.gsub!(/<-- .+? -->/m, "")
text.gsub!(/<HTML>.+?<\/HTML>/mi, "")
r = /http|internet|project gutenberg|mail|ocr/i
while text =~ /^(?:.+?END\*{1,2} ?|\*{3} START OF THE PROJECT GUTENBERG E(?:BOOK|TEXT).*? \*{3}|\*{9}END OF .+?|\*{3} END OF THE PROJECT GUTENBERG E(?:BOOK|TEXT).+?|\*{3}START\*.+\*START\*{3}|\**This file should be named .+|\*{5}These [eE](?:Books|texts) (?:Are|Were) Prepared By .+\*{5})$/
pre, post = $`, $'
text = if pre.length > post.length*3 then
pre
elsif post.length > pre.length*3 then
post
elsif pre.scan(r).length < post.scan(r).length
pre
else
post
end
end
text.gsub!(/^(?:Executive Director's Notes:|\[?Transcriber's Note|PREPARER'S NOTE|\[Redactor's note|\{This e-text has been prepared|As you may be aware, Project Gutenberg has been involved with|[\[\*]Portions of this header are|A note from the digitizer|ETEXT EDITOR'S BOOKMARKS|\[NOTE:|\[Project Gutenberg is|INFORMATION ABOUT THIS E-TEXT EDITION\n+|If you find any errors|This electronic edition was|Notes about this etext:|A request to all readers:|Comments on the preparation of the E-Text:|The base text for this edition has been provided by).+?\n(?:[\-\*]+)?\n\n/mi, "")
text.gsub!(/^[\[\n](?:[^\[\]\n]+\n)*[^\n]*(?:Project\sGutenberg|\setext\s|\s[A-Za-z0-9]+@[a-z\-]+\.(?:com|net))[^\n]*(?:\n[^\[\]\n]+)*[\]\n]$/i, "")
text.gsub!(/\{The end of etext of .+?\}/, "")
text = text.strip + "\n\n"
text.gsub!(/^(?:(?:End )?(?:of ?)?(?:by |This |The )?Project Gutenberg(?:'s )?(?:Etext)?|This (?:Gutenberg )?Etext).+?\n\n/mi, "")
text.gsub!(/^(?:\(?E?-?(?:text )?(?:prepared|Processed|scanned|Typed|Produced|Edited|Entered|Transcribed|Converted) by|Transcribed from|Scanning and first proofing by|Scanned and proofed by|This e-text|This EBook of|Scanned with|This Etext created by|This eBook was (?:produced|updated) by|Image files scanned in by|\[[^\n]*mostly scanned by).+?\n\n/mi, "")
return text
end
Original Japanese article: Extract body of Project Gutenberg’s text