skip to Main Content

I’m trying to build a simple PowerShell cmdlet that will use iText7 to open a PDF file and output the text. I’ve tried using the iText7Module from PowerShell Gallery, but the LocationTextExtractionStrategy that is part of iText7 is too specific on vertical location. I’d like to implement the approach here iText7 reading out lines in a wrong order to address it. The only functionality I’m looking to get out of iText7 is the text extraction, so I figured it would be best to build a custom cmdlet to do just that one thing (using the lax extraction strategy).

I’ve been incrementally building the cmdlet. Basically starting with a "Hello, World" cmdlet, I’ve been adding the needed iText7 commands. It worked until I tried to make the PdfDocument object. When I run the cmdlet, I get,

The type initializer for ‘iText.Commons.Actions.EventManager’ threw an exception.

I’m using Visual Studio Code. Pardon in advance, my C#/NET dev skills are pretty limited (all the rest of the heavy lifting of this project will get done in PowerShell).

using System;
using System.Management.Automation; 
using iText.Kernel.Pdf;

namespace TestCmdlet{
    [Cmdlet(VerbsCommon.Get, "LaxPDFText")]

    public class LaxPDFText : PSCmdlet {
        [Parameter(Mandatory = true)]
        public string filePath {get; set;}

        [Parameter(Mandatory = true)]
        public int laxRange {get; set;}

        protected override void BeginProcessing()
        {
          WriteObject("filePath: " + filePath);
          WriteObject("laxRange: " + laxRange);
          try {
            PdfReader pdfReader = new PdfReader(filePath);
            try {
              PdfDocument pdfDocument = new PdfDocument(pdfReader);
              pdfDocument.Close();
              }
            catch (Exception e) {WriteObject("Couldn't create pdfDocument"); WriteObject(e.Message); WriteObject(e.StackTrace);}
            pdfReader.Close();
          }
          catch (Exception e) {WriteObject("Couldn't create pdfReader"); WriteObject(e.Message); WriteObject(e.StackTrace);}
        }
    }
}

Then attempting in PowerShell…

PS C:UsersVan Drunensdocuments> import-module TestCmdlet
PS C:UsersVan Drunensdocuments> Get-LaxPDFText -filePath "c:usersvan drunensdocumentstest.pdf" -laxRange 5
filePath: c:usersvan drunensdocumentstest.pdf
laxRange: 5
Couldn't create pdfDocument
The type initializer for 'iText.Commons.Actions.EventManager' threw an exception.
   at iText.Kernel.Pdf.PdfDocument.Open(PdfVersion newPdfVersion)
   at TestCmdlet.LaxPDFText.BeginProcessing()

I’m perplexed at the error at iText.Kernel.Pdf.PdfDocument.Open(PdfVersion newPdfVersion) referencing a type of PdfVersion.

2

Answers


  1. Chosen as BEST ANSWER

    Adding InnerException helped track down the problem assemblies. Some were missing, but the final problem is the nuget build for iText7 seems to have an incorrect reference?

    Could not load file or assembly 'System.Text.Encoding.CodePages, Version=4.0.2.0, ...

    Looking at nuget, there is no such version. It appears there's some dll in iText7 that is referencing it (even though the iText7 project files don't as far as I can see).

    I substituted the iText7 dlls found for iText7Module on PowerShell gallery and targeted 7.2.0 (which is what they are labeled) and it now works. The iText7 dlls from iText7Module labeled as 7.2.0 aren't the same as the ones with the same version from nuget. Not sure what is going on exactly, but it resolved the issue.


  2. Building a custom PowerShell cmdlet for text extraction with iText7 is a great approach, especially if you want to customize the extraction behavior. The error you encountered regarding the EventManager suggests an issue with initialization that could be related to iText7 dependencies or the way the library is being loaded in the PowerShell environment.
    here are few things you could try to resolve
    
    **Verify .NET Runtime Compatibility** 
    
        <TargetFramework>net8.0</TargetFramework>
    
    **Check NuGet Package Dependencies**
    
        Install-Package itext7
    
    Test Text Extraction
    
        using System;
    using System.Management.Automation;
    using iText.Kernel.Pdf;
    using iText.Kernel.Pdf.Canvas.Parser;
    
    namespace TestCmdlet {
        [Cmdlet(VerbsCommon.Get, "LaxPDFText")]
        public class LaxPDFText : PSCmdlet {
            [Parameter(Mandatory = true)]
            public string filePath { get; set; }
    
            protected override void BeginProcessing() {
                WriteObject($"filePath: {filePath}");
    
                try {
                    using (PdfReader pdfReader = new PdfReader(filePath))
                    using (PdfDocument pdfDocument = new PdfDocument(pdfReader)) {
                        WriteObject("PdfDocument created successfully.");
                        string text = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(1));
                        WriteObject("Extracted Text:");
                        WriteObject(text);
                    }
                } catch (Exception e) {
                    WriteObject("Error extracting text from PDF.");
                    WriteObject($"Exception: {e.GetType().Name}, Message: {e.Message}");
                    WriteObject(e.StackTrace);
                }
            }
        }
    }
    Or
    **Customize Text Extraction Strategy**
    using iText.Kernel.Pdf.Canvas.Parser.Listener;
    using iText.Kernel.Pdf.Canvas.Parser;
    using System.Text;
    
    public class LaxTextExtractionStrategy : ITextExtractionStrategy {
        private StringBuilder text = new StringBuilder();
    
        public void EventOccurred(IEventData data, EventType type) {
            // Customize text extraction logic as needed
            if (data is TextRenderInfo renderInfo) {
                text.Append(renderInfo.GetText());
                text.Append(" "); // Add space to prevent words from sticking together
            }
        }
    
        public string GetResultantText() {
            return text.ToString();
        }
    
        public void BeginTextBlock() {}
        public void EndTextBlock() {}
        public void RenderText(TextRenderInfo renderInfo) {}
    }
    
    You would use this class in your LaxPDFText cmdlet when extracting text:
    
    
    string text = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(1), new LaxTextExtractionStrategy());
    

    Troubleshooting the EventManager Issue

    dotnet nuget locals all --clear
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search