-
 KDE-Apps.org Applications for the KDE-Desktop 
 GTK-Apps.org Applications using the GTK Toolkit 
 GnomeFiles.org Applications for GNOME 
 MeeGo-Central.org Applications for MeeGo 
 CLI-Apps.org Command Line Applications 
 Qt-Apps.org Free Qt Applications 
 Qt-Prop.org Proprietary Qt Applications 
 Maemo-Apps.org Applications for the Maemo Plattform 
 Java-Apps.org Free Java Applications 
 eyeOS-Apps.org Free eyeOS Applications 
 Wine-Apps.org Wine Applications 
 Server-Apps.org Server Applications 
 apps.ownCloud.com ownCloud Applications 
--
-
 KDE-Look.org Artwork for the KDE-Desktop 
 GNOME-Look.org Artwork for the GNOME-Desktop 
 Xfce-Look.org Artwork for the Xfce-Desktop 
 Box-Look.org Artwork for your Windowmanager 
 E17-Stuff.org Artwork for Enlightenment 
 Beryl-Themes.org Artwork for the Beryl Windowmanager 
 Compiz-Themes.org Artwork for the Compiz Windowmanager 
 EDE-Look.org Themes for your EDE Desktop 
--
-
 Debian-Art.org Stuff for Debian 
 Gentoo-Art.org Artwork for Gentoo Linux 
 SUSE-Art.org Artwork for openSUSE 
 Ubuntu-Art.org Artwork for Ubuntu 
 Kubuntu-Art.org Artwork for Kubuntu 
 LinuxMint-Art.org Artwork for Linux Mint 
 Arch-Stuff.org Art And Stuff for Arch Linux 
 Frugalware-Art.org Themes for Frugalware 
 Fedora-Art.org Artwork for Fedora Linux 
 Mandriva-Art.org Artwork for Mandriva Linux 
--
-
 KDE-Files.org Files for KDE Applications 
 OpenTemplate.org Documents for OpenOffice.org
 GIMPStuff.org Files for GIMP
 InkscapeStuff.org Files for Inkscape
 ScribusStuff.org Files for Scribus
 BlenderStuff.org Textures and Objects for Blender
 VLC-Addons.org Themes and Extensions for VLC
--
-
 KDE-Help.org Support for your KDE Desktop 
 GNOME-Help.org Support for your GNOME Desktop 
 Xfce-Help.org Support for your Xfce Desktop 
--
openDesktop.orgopenDesktop.org:   Applications   Artwork   Linux Distributions   Documents    LinuxDaily.com    Linux42.org    OpenSkillz.com   
 
Home
Apps
Artwork
News
Groups
Knowledge
Events
Forum
People
Jobs
Register
Login


-
- Content .- Fans  .- Knowledge Base  . 

TextRipper (aka T-Rip)

   2.0  

CLI other text tool

Score 56%
TextRipper (aka T-Rip)
zoom


TextRipper (aka T-Rip)
zoom


TextRipper (aka T-Rip)
zoom


Downloads:  143
Submitted:  Sep 18 2010
Updated:  Jan 14 2011

Description:

An OCR, Optical Character Recognition, gui application or cli script
# Supports the Tesseract engine by default!
# Optionally supports the Ocrad engine for multi-column text.
# These recognition engines have a very high character recognition success rate compared to other OCR's, including proprietary software.
# New: multi-page and multiple file selection support!
# Enhanced XSANE output and TIFF compatibility.
# New: now handles nearly any format out there!
# This script will convert any image of text into editable and indexable text. (for a full list of compatible file formats see the first filter below)
#
# REM: The better/cleaner/higher contrasted/higher resolution your image or scan is the better the results
#
# Dependencies: libtiff-dev (or -devel)(installed FIRST), tesseract-2.04 (latest stable-version), your chosen language data for Tesseract (2.00 and up) *1,
# ImageMagick, ghostscript, Zenity, and OpenOffice or other text editor *2
# This version of tesseract can be downloaded from here: http://code.google.com/p/tesseract-ocr/downloads/list
# Warning: This script will not work with the latest beta version (tesseract 3.00 pre-release) due to database structure modifications.
#
# Optional dependencies: ocrad ->an alternate recognition engine
# If inital results are unsatisfactory, maybe this engine will do better. Most importantly, it supports basic page format recognition. *3
# The latest version of ocrad can be downloaded off the GNU mirror list here: http://www.gnu.org/software/ocrad/
#
# Also: Make sure to select Unicode UTF-8 in OpenOffice's pop-up window (or text editor of your choice).
#
#
#
# *1 Install Tesseract after libtiff-dev. Then extract all the language databases you need into the "wherever_you_installed/tesseract-2.04/tessdata" directory.
# This is done automatically if you extract the language databases from WITHIN the "tesseract-2.04" directory (and allow overwriting).
# This script allows the use of multiple language databases. Default is English and French. For adding others see comments below.
# You NEED at least one language database or tesseract will not work.
# *2 Simply change the occurance of "soffice -writer" below to a text editor of your choice, ie: gedit, KWrite, etc
# Some systems call on OpenOffice Writer differently. If unsure, check the properties tab of your Writer launcher.
# Ie: On customized versions of OOo (such as the ones provided by Linux Mandrake or Gentoo), you start Writer with: oowriter
# *3 If you install ocrad also, TextRipper will recognize this and prompt you to choose between the two offering better recognition or page format support
#
# Troubleshooting:
# If this script ends saying your text editor can't open "OCR output-editable text.txt",
# or if run off the cli: Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset
# do (as superuser):
# echo /usr/local/share /usr/share | xargs -n 1 cp -R wherever_you_installed/tesseract-2.04/tessdata
# Explanation: Tesseract may call on the tessdata directory from the /share directory of your filesystem,
# so you need to make your language databases available from there.




LicenseGPL
(TextRipper)
Send to a friend
Subscribe
Other  Content  from kickass
Report inappropriate content



-

 Bad recognition

 
 by YAFU on: Sep 19 2010
 
Score 50%

I have tried with an image that contains only text, with:
./script.sh image_name.png

selecting utf-8 in OpenOffice, and I get a bad recognition, unintelligible text and symbols. I am using the latest version of ocrad. Am I doing something wrong?
Thanks.


Reply to this

-
.

 Re: Bad recognition

 
 by kickass on: Sep 19 2010
 
Score 50%

If your text editor ends showing something, anything, you're doing it right. Now, that means the problem is the source file. Attempt to use GIMP to save it as a .pnm format. If this fails here are a few pointers:
1) Latin characters, right? Not Korean or something like that.
2) Resolution? Contrast? The higher the better. Grayscale better than color.
3) Clean original, not smudged or spotted.
4) 1 column and 1 page. If you have say a 2 column format then use GIMP to crop to 2 separate pages and save as .pnm.
5) I'm converting non-pnm class images with ImageMagick. Do you have that installed?
If you want/can, you may send me your image and I'll see what I can do with it.


Reply to this

-

 Re: Re: Bad recognition

 
 by YAFU on: Sep 19 2010
 
Score 50%

Hello.
I have installed ImageMagick and zenity.
I have used this screen capture:
http://img819.imageshack.us/img819/6227/testvo.jpg

and with OpenOffice and Kate I get similar results. I use latin characters, my locale is es_AR.
I have tested with the same image in png and pnm too.


Reply to this

-
.

 Re: Bad recognition

 
 by kickass on: Sep 20 2010
 
Score 50%

Hello: Yafu
Okay the source file has a very low resolution. You will notice this zooming in (either ctrl+scroll or in GIMP).
I did however improve on the app.
Ver. 1.1 will give you a workable output even from this low-res image. Sorry about wasting your time with installing a new recognition engine, but the results are even better. The thing is, this app is being very successful and I'm being swarmed with emails among of which are many useful comments.
d.


Reply to this

-

 Re: Re: Bad recognition

 
 by YAFU on: Sep 21 2010
 
Score 50%

Much better with this engine. No waste of time for me, I like to try new programs or scripts. Thanks for the time you have taken in developing the script..
I also believe that in Linux we are far from the recognition capabilities of other software such as Acrobat. If you have some knowledge and are interested you can collaborate with the project "gscan2pdf"
Regards.


Reply to this

-

 Re: Ver. 2.0

 
 by kickass on: Sep 21 2010
 
Score 50%

Just to let you know, I'm working on doing the nearly impossible: extract text from multi-paged pdf's using my same fast and easy do it all script.
Thanks for your feedback. I'll chk out the project.
d.


Reply to this

-

 Re: Re: Re: Bad recognition

 
 by kickass on: Dec 10 2010
 
Score 50%

Just to let you know that I uploaded TextRipper, the new improved version of Text Recognition. Hope you like it.
d.


Reply to this

Add commentBack






-
openDesktop.org Facebook App




 
 
 Who we are
Contact
More about us
Frequently Asked Questions
Register
Twitter
Blog
Explore
Apps
Artwork
Jobs
Knowledge
Events
People
Updates on identi.ca
Updates on Twitter
Facebook App
Content RSS   
Events RSS   

Participate
Groups
Forum
Add Content
Public API
About openDesktop.org
Legal Notice
Spreadshirt Shop
CafePress Shop
Advertising
Sponsor us
Report Abuse
 

Copyright 2007-2014 openDesktop.org Team  
All rights reserved. openDesktop.org is not liable for any content or goods on this site.
All contributors are responsible for the lawfulness of their uploads.
openDesktop is a trademark of the openDesktop.org Team