Feeding in Jpeg stream from pdf gives error

Dec 27, 2013 at 6:15 PM
I have a bunch of pdf files that are composed solely of scanned document jpegs. I am trying to shrink the file by pulling out each jpeg and processing it with Magick.NET. I am using PDFSharp to extract all the jpegs from each pdf into a stream object and then feeding the stream to Magick.Net.

The problem is that when I do it this way, IM does not recognize the file as a jpeg and wants to invoke a BLOB delegate and throws an exception.

I am sure that the stream contains a jpeg because I can save the string to a file and open it as an image. I can even re-read that file back into IM without issue. I have even tried to detect the mime type of the stream before feeding it into IM with urlmon.dll and it confirms that the data is an "image/pjpeg". I just can't instantiate a MagickImage from the stream!

What am I doing wrong here?
Dec 27, 2013 at 7:11 PM
I am not sure why IM is not detecting your stream as a jpeg. You can help IM by specifying the format before you load your image. You can do that with the MagickReadSettings class:
MagickReadSettings settings= new MagickReadSettings();
settings.Format = MagickFormat.Pjpeg;
using (MagickImage image = new MagickImage(stream, settings))
Dec 30, 2013 at 2:53 PM
I did some more testing and got it to work. I turns out that the problem occurred because I was taking the byte array that was extracted from the PDF and saving that to a MemoryStream object. IM threw the exception when the MagickImage object was created with that MemoryStream. But when I instantiated the MagickImage from the byte array directly, it worked fine. Weird! Anyway, we're good now.

On a related note, now that I got jpegs working, is there any way for IM to open raster images saved with one of the other PDF filter types (FlateDecode, CCITTFaxDecode, JPXDecode, JBIG2Decode, etc.)?

Dec 30, 2013 at 7:33 PM
ImageMagick supports a lot of different file types. I don't know all the types that are supported, you have to try it yourself. Maybe you should consider to install GhostScript (https://magick.codeplex.com/wikipage?title=Convert%20PDF&referringTitle=Documentation). You can just read the pdf and Magick.NET will convert each page of your pdf to an image.

Did you reset the Position of your MemoryStream to the beginning? This might explain why reading from the MemoryStream did not work.
Dec 30, 2013 at 8:17 PM
Edited Dec 30, 2013 at 8:17 PM
I reset the MemoryStream and it worked. Thanks!

I already tried using GhostScript to save pdf pages as images. But I have not had much success with the image quality with the images that ghostscript rasterizes. They come out very low quality. That's why I want to read the pdf and pull the images directly from it. Any other suggestions would be very welcome! Thanks!
Dec 30, 2013 at 8:47 PM
Edited Dec 30, 2013 at 8:49 PM
You can improve the image quality by setting the Density property of MagickReadSettings. You probably have to Resize your output image to a smaller size after you have Read it because settings the Density will change the size of your image.