XPDF Filter
This is an alternative suite of MediaFilter plugins that offers faster and more reliable text extraction from PDF Bitstreams, as well as thumbnail image generation. It replaces the built-in default PDF MediaFilter.
If this filter is so much better, why isn't it the default? The answer is that it relies on external executable programs which must be obtained and installed for your server platform. This would add too much complexity to the installation process, so it left out as an optional "extra" step.
Installation Overview
Here are the steps required to install and configure the filters:
- Install the xpdf tools for your platform, from the downloads at http://www.foolabs.com/xpdf
- Acquire the Sun Java Advanced Imaging Tools and create a local Maven package.
- Edit DSpace configuration properties to add location of xpdf executables, reconfigure MediaFilter plugins.
- Build and install DSpace, adding -Pxpdf-mediafilter-support to Maven invocation.
Install XPDF Tools
First, download the XPDF suite found at: http://www.foolabs.com/xpdf and install it on your server. The executables can be located anywhere, but make a note of the full path to each command.
You may be able to download a binary distribution for your platform, which simplifies installation. Xpdf is readily available for Linux, Solaris, MacOSX, Windows, NetBSD, HP-UX, AIX, and OpenVMS, and is reported to work on AIX, OS/2, and many other systems.
The only tools you really need are:
- pdfinfo - displays properties and Info dict
- pdftotext - extracts text from PDF
- pdftoppm - images PDF for thumbnails
Fetch and install jai_imageio JAR
Fetch and install the Java Advanced Imaging Image I/O Tools.
For AIX, Sun support has the following: "JAI has native acceleration for the above but it also works in pure Java mode. So as long as you have an appropriate JDK for AIX (1.3 or later, I believe), you should be able to use it. You can download any of them, extract just the jars, and put those in your $CLASSPATH."
Download the jai_imageio library version 1.0_01 or 1.1 found at: https://jai-imageio.dev.java.net/binary-builds.html#Stable_builds .
For these filters you do NOT have to worry about the native code, just the JAR, so choose a download for any platform.
curl -O http: //download.java.net/media/jai-imageio/builds/release/1.1/jai_imageio-1_1-lib-linux-i586.tar.gz tar xzf jai_imageio-1_1-lib-linux-i586.tar.gz |
The preceding example leaves the JAR in jai_imageio-1_1/lib/jai_imageio.jar . Now install it in your local Maven repository, e.g.: (changing the path after file= if necessary)
mvn install:install-file \ -Dfile=jai_imageio-1_1/lib/jai_imageio.jar \ -DgroupId=com.sun.media \ -DartifactId=jai_imageio \ -Dversion= 1 .0_01 \ -Dpackaging=jar \ -DgeneratePom= true |
You may have to repeat this procedure for the jai_core.jar library, as well, if it is not available in any of the public Maven repositories. Once acquired, this command installs it locally:
mvn install:install-file -Dfile=jai_core- 1.1 .2_01.jar \ -DgroupId=javax.media -DartifactId=jai_core -Dversion= 1.1 .2_01 -Dpackaging=jar -DgeneratePom= true |
Edit DSpace Configuration
First, be sure there is a value for thumbnail.maxwidth and that it corresponds to the size you want for preview images for the UI, e.g.: (NOTE: this code doesn't pay any attention to thumbnail.maxheight but it's best to set it too so the other thumbnail filters make square images.)
# maximum width and height of generated thumbnails thumbnail.maxwidth= 80 thumbnail.maxheight = 80 |
Now, add the absolute paths to the XPDF tools you installed. In this example they are installed under /usr/local/bin (a logical place on Linux and MacOSX), but they may be anywhere.
xpdf.path.pdftotext = /usr/local/bin/pdftotext xpdf.path.pdftoppm = /usr/local/bin/pdftoppm xpdf.path.pdfinfo = /usr/local/bin/pdfinfo |
Change the MediaFilter plugin configuration to remove the old org.dspace.app.mediafilter.PDFFilter and add the new filters, e.g: (New sections are in bold)
filter.plugins = \ PDF Text Extractor, \ PDF Thumbnail, \ HTML Text Extractor, \ Word Text Extractor, \ JPEG Thumbnail plugin.named.org.dspace.app.mediafilter.FormatFilter = \ org.dspace.app.mediafilter.XPDF2Text = PDF Text Extractor, \ org.dspace.app.mediafilter.XPDF2Thumbnail = PDF Thumbnail, \ org.dspace.app.mediafilter.HTMLFilter = HTML Text Extractor, \ org.dspace.app.mediafilter.WordFilter = Word Text Extractor, \ org.dspace.app.mediafilter.JPEGFilter = JPEG Thumbnail, \ org.dspace.app.mediafilter.BrandedPreviewJPEGFilter = Branded Preview JPEG |
Then add the input format configuration properties for each of the new filters, e.g.:
filter.org.dspace.app.mediafilter.XPDF2Thumbnail.inputFormats = Adobe PDF filter.org.dspace.app.mediafilter.XPDF2Text.inputFormats = Adobe PDF |
Finally, if you want PDF thumbnail images, don't forget to add that filter name to the filter.plugins property, e.g.:
filter.plugins = PDF Thumbnail, PDF Text Extractor, ... |
Build and Install
Follow your usual DSpace installation/update procedure, only add -Pxpdf-mediafilter-support to the Maven invocation:
mvn -Pxpdf-mediafilter-support package ant -Dconfig=\[dspace\]/config/dspace.cfg update
|
Статья оригинал.