Software for everyone

We talk a lot about hardware here at BookLiberator, it is what we spend most of our time on after all, but it is time to shine a light on the software behind the scenes that turns our page images into beautifully produced “book” collections. That software comes in two parts, scantailor, written by Joseph Artsimovich and djvubind, written by strider1551 of DIYBookScanner.

Scantailor takes the page images from your camera’s memory card:

Page from Concerning Beards
and turns them into nicely cropped, rotated, and white balanced images like this:
Processed image from Concerning Beards

Djuvubind takes all of those individual images, stitches them together, and compresses that into a very tiny book in the djvu format. I have 1400 page academic books that are now pleasantly readable 10 MB files thanks to this combination of Scantailor and Djvubind.

All of this happens automatically. For each of those 1400 page books all I had to do was 1) rotate the first two pages, 2) hit “Go” for auto crop, 3) draw a box around the few pictures so that their full resolution would be preserved in the final output, 4) run djvubind.

Very simple, very easy. When djvubind, which is less than two weeks old, gets the last kinks out, it will be possible to use the same 4 steps to get a tiny book full of beautiful page images which also has a layer of OCR embedded for text searching.

For anyone who has been waiting to get into personal book scanning until the software develops, wait no more.

Crossposted with churchkey.org

Posted in Community, Documentation | Tagged , , , , , , , , , , | Leave a comment

Design-Complete Prototypes

After many months of design and iterative prototyping and at the cost of a small amount of spilled blood, we are happy to announce that we have a final design for the Book Liberator. Take a look:

   
   

This overall design is not much different from our early builds, but it includes many small improvements that make the device operate more smoothly and reliably. You’ll also notice that the quality of the carpentry has improved.

We’re now in the thick of scaling up a manufacturing process. In other words, you will actually be able to buy these soon. Our current target timeframe is to have them up for sale this fall! We haven’t nailed down pricing yet, but we hope to hit $350, with cameras.

Posted in Uncategorized | Comments closed

Book Liberator in Forbes

Ian and I demoed our design-complete prototype for Forbes, and they did a good writeup on the device. This will help get the word out. Tell your friends, warn your enemies: Book Liberator is coming, and it will scan your books!

Posted in Uncategorized | Comments closed

Book Liberator at HOPE

Book Liberator cadged some table space at HOPE from our sponsor, Question Copyright. We met lots and lots of awesome hackers, and discovered they all love the Book Liberator. We started a lot of good and useful conversations this weekend about everything from manufacturing to remote shutter trigger to lighting options. We’ll be in touch with many of you to continue those discussions in the coming weeks.

Karl and Ian and I are excited about the interest shown in the project and enjoyed the chance to show off our design-complete prototype.

Thanks are owed to Barry, Clyde, and Gordon for tabling with us, and to Nina Paley for some fast and beautiful work toward a Book Liberator logo!

Posted in Uncategorized | Comments closed

Bittorrent and Miro, a better Distributed Proofreading

If you spend some time in the ebook community you inevitably run into Distributed Proofreading, the collaborative proofreading group that supplies Project Gutenberg with high quality text versions of Public Domain books. They are a small community of dedicated editors doing good work. Unfortunately, they are also becoming irrelevant to most of the issues in the field because their multi-layer workflow is simply too slow. When organizations like Google are releasing a million books at once, it is hard to stay relevant when struggling to complete your project’s 20,000 book, even if those books, unlike Google’s, are meticulously verified and formatted. Scale and quality both matter and, if we structure it right, we can rework our communal digitization projects to get both.

Currently, Distributed Proofreaders only releases books after spending weeks or months verifying that the text version matches the original page images. The industrial scanning efforts like Google Books and the Million Books Project generally skip verification entirely and distribute raw text versions with the photographic page images. This is perhaps the greatest key to their large size. Yes, they also paid for large scale scanning but scanning is easy compared to proofreading, and getting getting easier all the time. You can be sure that Google’s library would not be half so large if they had to pay for the kind of quality that Distributed Proofreaders provides. Unfortunately, if the price of this quality is only having thousands rather than millions of books, it is too high to continue paying.

I propose a middle road between the raw image release and the meticulous text one. What if we distributed raw image and unverified text files from day one, but build our distribution network to enable everyone downloading a copy to upload corrections and share those corrections automatically with everyone else who has a copy? If we did that we could gain speed and scale while also building our community of contributers.

Technologically, bittorrent and a rich client like miro would get us most of the way there. We would make each book into a miro channel that people would subscribe to when downloading the book. Once downloaded we would need a book reading view that we could optimize for whatever common reader actions relate to proofreading. Things like spell check and revealing the text around a section to verify academic citations spring immediately to mind. The key is that corrections should come primarily from people’s normal interactions with the books they are interested in, no altruism or active volunteering necessary. Once people have corrected their local copies, the client sends those corrections back to the central server where they can be sent out via rss to everyone subscribed to that book’s channel.

As far as the user is concerned, she simply downloads the books she is interested in with her miro-based library manager and either fixes errors as they bother her, or leaves them alone and watches the text gradually correct itself as other people interested in the same books notice and correct errors. If the errors are really frustrating, she can always fall back to reading the page images and be no worse off than if reading on Google Books or any other large page image-based digital library.

As far as the community is concerned, we get a larger pool of potential contributers because now everyone with a copy can contribute back, and people are able to contribute by sharing spare hard drive space and unused bandwidth rather than having to donate funds to pay for central hosting and distribution. There are plenty of people in the community who have no time or inclination to proofread but would gladly download some book images and leave a torrent running in the background to help share the files more widely.

Making it easier to contribute increases the effectiveness of the project as a whole by helping make sure that all the people who care about a book have the opportunity to put their time into preserving that book. The more people care, the more work gets done. In two years of talking with people about my own book digitization projects, I have grown to have a healthy respect for how much people care about their own books and about preserving them, in whatever form.

In the end, there are only two scalable digitization strategies: teach computers to read, or harness the passion people have for their books for the benefit of us all. A handful of highly organized editors like the Distributed Proofreaders community will always have it’s place, but they cannot handle the scale of this project alone. We should make sure they have some help.

(Crossposted with churchkey.org)

Posted in Community | Tagged , , , , , , , | Comments closed

Book Liberator in the News

The Book Liberator got a writeup in Good magazine! I sent in hundreds of rambling words about the project, and Theo distilled them into a few pithy quotes. Thanks, Theo, for making me seem clever!

Posted in Uncategorized | Comments closed

Prototypes ahoy!

Last week, Ian and Winnie got all heroic with some tools, wood and plexi. The result is a couple sweet prototypes, which we’ll be sending to the Decapod folks so they can hack software to process BookLib images.

In other news, I put a prototype design of the camera mount on thingiverse. Ian’s original washer-and-bolt design was a little janky, and when we get the parameters right on the mount, we should be able to print them quite cheaply.

We’re moving quite quickly towards a shippable kit. The cradle design is stable. We have dimensions for the plexi and the cube. We’re down to exploring two basic design paths (bent plexi vs. two flat sheets). Everything else about the prototype is in the detail stage.

There are photos of the wood hackery around, and I’ll try to post some soon.

Posted in Uncategorized | Tagged , , | Comments closed

The people’s words

One of the biggest problems for people, like Project Gutenberg, who want to digitize and share our culture’s public domain works, is tracking down and confirming that a work is no longer under copyright. Gutenberg is not alone, towards the end of last month I ran into an opinion piece on teleread arguing that Amazon is right to keep away from public domain books for this same reason.

In the United States we have a resource with authoritative records about which works are covered by copyright and which ones are in the public domain, it is called the Library of Congress. Not only does the Library of Congress have authoritative records but, as the largest library in the world, it has physical copies of more works than any other institution. Unfortunately, the Library of Congress has no plans to digitize their collection. For those of us involved with book digitization, this is something of a sore topic.

So it was a great moment for me this morning to read that the Japanese National Diet Library, a close equivalent of our Library of Congress, is digitizing all their out of copyright works. Not only is the Diet digitizing and distributing the of out of copyright works, they are also beginning a process of digitizing the portions of their collection still under copyright in order to preserve those works more easily against physical destruction.

Of course, if preservation is our goal, the true solution is obvious and has been known in this country since its founding:

“”"
[T]he lost cannot be recovered; but let us save what remains: not by vaults and locks which fence them from the public eye and use, in consigning them to the waste of time, but by such a multiplication of copies, as shall place them beyond the reach of accident.
“”"
-Thomas Jefferson
(Boyd, ‘These Precious Monuments of…Our History,’ pp.175-6)

Whatever the reason, it is great to see leading institutions take steps to share the public domain with the public.

Posted in In the News | Comments closed

Pushing Paper

Great piece up today by Paul Grahm called Post-Medium Publishing:

Almost every form of publishing has been organized as if the medium was what they were selling, and the content was irrelevant. Book publishers, for example, set prices based on the cost of producing and distributing books. They treat the words printed in the book the same way a textile manufacturer treats the patterns printed on its fabrics.

Economically, the print media are in the business of marking up paper. We can all imagine an old-style editor getting a scoop and saying “this will sell a lot of papers!” Cross out that final S and you’re describing their business model. The reason they make less money now is that people don’t need as much paper.

Your “content” is ripe for unpulping.

Posted in Uncategorized | Comments closed

A Bit about Book Ripping

Digitizing your own books

The Book Ripper community, bkrpr.org, came together to take the difficulty out of digitizing books. Unlike music, movies, or even loose paper, books have proved surprisingly difficult to break out of their analog format. Very complicated robotic scanners, costing tens of thousands of dollars, have been built to address this problem, but their size and cost make them practical only for large institutions, leaving individuals who want digital books at the mercy of book publishers.

As it turns out, digitizing your books is not hard. The advances in small cameras make it possible to achieve high quality results cheaply and at a rate of 600-900 pages per hour. That is what we do at bkrpr.org and there are a number of advantages compared to getting your ebooks from publishers.

Cost

The most impressive advantage is cost. For people who own books already, getting digital copies of those books from publishers is an expensive prospect. Commercial ebooks have no commodity price and can vary wildly by publishing outlet, but let’s assume a $10 price for each ebook. The book ripper design we use costs around $250 dollars, which includes the price of two small point and shoot cameras. If you own more than 25 books, building a scanner will be cheaper than buying electronic editions. For those of us that own hundreds or thousands of books, the math becomes obvious.

Control

In the wake of Amazon’s memory hole-ing of George Orwell’s works, their retroactive disabling of the text-to-speech capabilities on new readers, and the continuing industry wide obsession with DRM, control over your ebooks has been gaining visibility as an issue in the digitization of our vast printed catalogue. With publisher-made ebooks, they control what devices can read it, what software can do to it, where it can be stored, how many times you can download it, and how long you have access to it; people doubt so strongly that you will even be able to read the closed formats that publishers sell books in that they suggest insurance as a way to cover your losses when your digital copies disappear.

The books that you convert, you control.

Authority

Of course, the illegal distribution channels release everything in free formats, and release it all for free, so there they would seem to be ahead of publishers on both fronts, and much less effort than home book ripping. Where the illegal copy market falls short, besides the obvious issues of copyright infringement, is in the reliability of their versions.

Illegal copies are known for typos and OCR errors, lack of text and page formatting, and spotty availability of works. Unfortunately, legal ebooks are known for these same things. Neither can be relied upon as an authoritative representation of the author’s work and neither offer any way to verify or improve the accuracy of the digital work other than by reference to the printed one.

In contrast, when you scan the books yourself, you retain high quality images of every page. Viewers and other tools will let you jump back and forth from the text to the image versions. OCR can be corrected over time or re-run with better software and formatting can be added or corrected, but only if you have the page images.

Until digital distribution becomes the original and authoritative method of book publishing, as it has for the web, having the page images will remain the only way to guarantee or improve the accuracy of your digital books.

Because you love your books

If you love your books, if you care enough about them that you need every word to be right and you want the digital copy to be as beautiful as the paper one, you should scan them yourself. If you don’t care that much about a book, the publishers’ copy or the illegal copy may be all you need, or you might be better off cutting the spines off your existing books and feeding them through a high speed USB scanner. You can always recycle the pages afterwards.

If nothing but the best will do, or no other options are available, come on over to bkrpr.org and see how easy it is.

Posted in Uncategorized | Comments closed