Imagine that you feed a system a bunch of pdfs that you, and only you, know “how” these are related (e.g. they are all dissertations, or news, or invoices). The system know that the batch is connected, but does not know how they relate.
The system then scan these pdf’s, and suggest indexes and their respective value for each document.
Here’s an example: You feed a the system all the invoices your company gets. The system process these docs and suggests for indexes “Supplier”, “Invoice Cost” and “Due Date”. Foreach pdf the system also extracts the value of the entry.
So my question is: what kind of artificial intelligence system is most adequate for this scenario? A Neural Network? A combination?
2
Answers
You could do this by just a keyword search, if you know what keyword the machine should be looking for, and the documents all follow the same format.
If the formats are non-uniform within each category, however, then you would need to consider some kind of language processing in order for the machine to be able to understand what’s going on.
Try do some research into natural language processing, this is probably along the lines of what you’re looking for:
NLP Wiki
You are looking for unsupervised learning algorithms. More specifically, yours is a clustering problem, since your system does not know anything about the data it is going to analyze and it has to come up with a correct classification of the documents (or their properties).
In your example, by using clustering algorithms, your system can learn to distinguish the documents you provide and to extract the field “Invoice”, “Supplier” …
The wiki page I linked should be enough to have a general idea of the class of algorithms you need. On Google you will find a plethora of lecture slides on the topic.