I’m looking for an open source document management system, to index all kind of files (texts : [pdf, doc…], images [jpg, png, bmp…], videos [mov, mp4…])
and i stumbled upon Datafari
It uses Solr search enging, and ManifoldCF to manage content repository connection and has Tika connector to help searching through metadata.
I installed it and i’m trying to do the setting in order to have it find images searched on metadata criteria but with no luck so far.
I added a local repository with an image with some metadata :
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Artist" content="tarzan"/>
<meta name="date" content="2015-03-28T09:47:45"/>
<meta name="Print flags information" content="0 1 0 0 0 0 0 0 0 2"/>
<meta name="Slices" content="zebre (0,0,500,500) 1 Slices"/>
<meta name="ICC Untagged Profile" content="1"/>
<meta name="Compression Type" content="Baseline"/>
<meta name="subject" content="legs"/>
<meta name="subject" content="mammal"/>
<meta name="Image Description" content="this kind of animal is hard to see behind bar"/>
<meta name="Thumbnail Compression" content="JPEG (old-style)"/>
<meta name="Print flags" content="0 0 0 0 0 0 0 0 1"/>
<meta name="By-line" content="tarzan"/>
<meta name="Number of Components" content="3"/>
<meta name="Component 2" content="Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert"/>
<meta name="Component 1" content="Y component: Quantization table 0, Sampling factors 1 horiz/1 vert"/>
<meta name="tiff:ResolutionUnit" content="Inch"/>
<meta name="Object Name" content="king of disguise"/>
<meta name="Seed number" content="1"/>
<meta name="X Resolution" content="72 dots per inch"/>
<meta name="IPTC-NAA record" content="160 bytes binary data"/>
<meta name="Unknown tag (0x043a)" content="[239 bytes]"/>
<meta name="Version Info" content="1 (Adobe Photoshop, Adobe Photoshop CS6) 1"/>
<meta name="Component 3" content="Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert"/>
<meta name="dc:title" content="king of disguise"/>
<meta name="modified" content="2015-03-28T09:47:45"/>
<meta name="Thumbnail Data" content="JpegRGB, 160x160, Decomp 76800 bytes, 1572865 bpp, 6513 bytes"/>
<meta name="tiff:BitsPerSample" content="8"/>
<meta name="Application Record Version" content="42432"/>
<meta name="Resolution Info" content="72.0x72.0 DPI"/>
<meta name="meta:author" content="tarzan"/>
<meta name="meta:creation-date" content="2015-03-28T09:47:45"/>
<meta name="Caption digest" content="[16 bytes]"/>
<meta name="Creation-Date" content="2015-03-28T09:47:45"/>
<meta name="resourceName" content="zebre.jpg"/>
<meta name="Orientation" content="Top, left side (Horizontal / normal)"/>
<meta name="tiff:Orientation" content="1"/>
<meta name="tiff:Software" content="Adobe Photoshop CS6 (Windows)"/>
<meta name="Thumbnail Offset" content="354 bytes"/>
<meta name="Color Transform" content="YCbCr"/>
<meta name="Global Angle" content="120"/>
<meta name="Author" content="tarzan"/>
<meta name="Exif Image Height" content="500 pixels"/>
<meta name="Software" content="Adobe Photoshop CS6 (Windows)"/>
<meta name="tiff:YResolution" content="72.0"/>
<meta name="Y Resolution" content="72 dots per inch"/>
<meta name="dc:description" content="this kind of animal is hard to see behind bars"/>
<meta name="Color transfer functions" content="[112 bytes]"/>
<meta name="Keywords" content="legs"/>
<meta name="Keywords" content="mammal"/>
<meta name="Data Precision" content="8 bits"/>
<meta name="Coded Character Set" content="%G"/>
<meta name="dc:creator" content="tarzan"/>
<meta name="tiff:ImageLength" content="500"/>
<meta name="description" content="this kind of animal is hard to see behind bars"/>
<meta name="JPEG quality" content="12 (Maximum), Standard format, 3 scans"/>
<meta name="dcterms:created" content="2015-03-28T09:47:45"/>
<meta name="dcterms:modified" content="2015-03-28T09:47:45"/>
<meta name="Last-Modified" content="2015-03-28T09:47:45"/>
<meta name="Last-Save-Date" content="2015-03-28T09:47:45"/>
<meta name="Thumbnail Length" content="6513 bytes"/>
<meta name="Color Space" content="Undefined"/>
<meta name="Credit" content="tarzan"/>
<meta name="Global Altitude" content="30"/>
<meta name="meta:save-date" content="2015-03-28T09:47:45"/>
<meta name="Country/Primary Location Name" content="kenya"/>
<meta name="Content-Length" content="93123"/>
<meta name="Content-Type" content="image/jpeg"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.jpeg.JpegParser"/>
<meta name="creator" content="tarzan"/>
<meta name="Color halftoning information" content="[72 bytes]"/>
<meta name="dc:subject" content="legs"/>
<meta name="dc:subject" content="mammal"/>
<meta name="tiff:XResolution" content="72.0"/>
<meta name="Date/Time" content="2015:03:28 09:47:45"/>
<meta name="Grid and guides information" content="[16 bytes]"/>
<meta name="Caption/Abstract" content="this kind of animal is hard to see behind bars"/>
<meta name="DCT Encode Version" content="1"/>
<meta name="Exif Image Width" content="500 pixels"/>
<meta name="Image Height" content="500 pixels"/>
<meta name="Pixel Aspect Ratio" content="1.0"/>
<meta name="Supplemental Category(s)" content="earthly creature"/>
<meta name="Image Width" content="500 pixels"/>
<meta name="Flags 0" content="64"/>
<meta name="Resolution Unit" content="Inch"/>
<meta name="Unknown tag (0x043b)" content="[557 bytes]"/>
<meta name="URL List" content="0"/>
<meta name="meta:keyword" content="legs"/>
<meta name="meta:keyword" content="mammal"/>
<meta name="Print Scale" content="Centered, Scale 1.0"/>
<meta name="tiff:ImageWidth" content="500"/>
<meta name="Flags 1" content="0"/>
<title>king of disguise</title>
</head>
<body/></html>
In solr schema.xml i’ve added the fields i needed :
<fields>
...
<field name="subject" type="string" indexed="true" stored="true" multiValued="true" />
Then i restarted the server
In ManifoldCF administration in the Job list, i added a Tika extractor transformation in the Job:
The pipeline is : my repository -> Tika Extractor -> DatafariSolr
I tried the search in Solr interface :
For q, i tried "subject:legs"
and there, i Solr interface, i retrieved the data
But in Datafari search engine, i’ve got no results
The help for Datafari is not very helpful and i looked into Manifoldcf documentation but with no more luck.
I would like to have a real example for this kind of search through metadata.
What should be modified and / or tested to see the image in the result ?
Update after Olivier Tavard answer :
Thank you for your help. This tool is really promising, though i have still problems configuring it :
i can’t find datafari/WebContent/js/search.js. Did you mean : datafari/tomcat/webapps/Datafari/js/search.js ?
Id added what you suggested.
I also added the fields “description” and “creator”.
1 – In SolR search :
– if i search in q “animal” i can retrieve my image (not with “animal”), which is now better than “description:animal“.
– But if i search “legs” i don’t retrieve anything. Is it because there are several <meta> “subject”, there is a different way to search it ?
– if i search “tarzan” (from creator field), i don’t retrieve anything either.
2 – in Datafari UI search :
– the changes i made seems to have “broken” the search : when i search i have the wheel turning all the time. In the console i have:
GET "http://localhost:8080/Datafari/css/menu.css" 404
L'utilisation d'XMLHttpRequest de façon synchrone sur le fil d'exécution principal est obsolète à cause de son impact négatif sur la navigation de l'utilisateur final.
3 – i added another picture with other metadata for the same fields, and in SolR Search, if i query for “jpg”, they both appears (OK), but in the json response, the extra fields don’t appear for the other image !
{
"responseHeader": {
"status": 0,
"QTime": 6,
"params": {
"indent": "true",
"q": "jpgn",
"_": "1427968093838",
"wt": "json"
}
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
{
"last_modified": "2015-03-28T09:47:45Z",
"id": "file:/home/olivier/Bureau/datafari/images/zebre.jpg",
"url": "file:/home/olivier/Bureau/datafari/images/zebre.jpg",
"source": "file",
"extension": "jpg",
"language": "en",
"content_en": [
""
],
"title_en": [
"zebre.jpg"
],
"title": [
"zebre.jpg"
],
"_version_": 1496971642075611100,
"allow_token_share": [
"__nosecurity__"
],
"deny_token_document": [
"__nosecurity__"
],
"deny_token_share": [
"__nosecurity__"
],
"allow_token_document": [
"__nosecurity__"
]
},
{
"last_modified": "2015-03-29T15:45:23Z",
"subject": [
"Description Mots clé"
],
"id": "file:/home/olivier/Bureau/datafari/metadata/image1toto.jpg",
"creator": [
"Description, IPTC - Auteur: beta"
],
"description": [
"Description Description : gamma"
],
"url": "file:/home/olivier/Bureau/datafari/metadata/image1toto.jpg",
"source": "file",
"extension": "jpg",
"language": "en",
"content_en": [
""
],
"title_en": [
"image1toto.jpg"
],
"title": [
"image1toto.jpg"
],
"_version_": 1497001790322770000,
"allow_token_share": [
"__nosecurity__"
],
"deny_token_document": [
"__nosecurity__"
],
"deny_token_share": [
"__nosecurity__"
],
"allow_token_document": [
"__nosecurity__"
]
}
]
},
"highlighting": {
"file:/home/olivier/Bureau/datafari/images/imagejpg.jpg": {
"content_fr": [
""
],
"content_en": [
""
]
},
"file:/home/olivier/Bureau/datafari/images/zebre.jpg": {
"content_fr": [
""
],
"content_en": [
""
]
},
"file:/home/olivier/Bureau/datafari/metadata/image1toto.jpg": {
"content_fr": [
""
],
"content_en": [
""
]
}
},
"spellcheck": {
"suggestions": []
},
"capsuleSearchComponent": {}
}
I’m very confused.
Edit after Olivier Tavard answer
Sorry for the late answer, i’m working on something urgent atm, and couldn’t test/answer as i wished.
I followed your steps (very didactic, thanks), and somewhat managed to have the result in the client search 🙂
But :
1- i had to use wildcards to find it in the datafari gui : “a horse in disguise” => i had to put ‘**horse*’, and not ‘horse’
2 – How to retrieve data for multiple fields (ex: meta:keyword …)
<meta name="meta:keyword" content="legs"/>
<meta name="meta:keyword" content="mammal"/>
3 – I had a “standard” install but i have a 404 for localhost:8080/Datafari/css/menu.css, maybe it’s why i get the searchwheel until i refresh the page
2
Answers
Thanks for using Datafari. To add the display of your field into the UI you have to modify 2 files :
datafari/tomcat/webapps/Datafari/js/main.js
Change the line:
And add the field you want to add, in your example subject:
datafari/WebContent/js/search.js
Add the display of your field adding the code :
doc.subject
where you want to add it. For example if you want to add it after the URL of the document:If your problem is related to the search : searching legs does not retrieve any results, you have to change the Solr configuration in datafari/solr/solr_home/FileShare/conf/solrconfig.xml:
And add the field subject in qf (and pf if you want) into the list :
We begin to put some documentation here if you are interested.
OK I will try to complete the answer.
I start from a vanilla installation of Datafari I downloaded on datafari.com to reproduce the steps.
Let’s say we want to add a new metadata field in Solr named meta:author in the source and named authorname in Datafari.
Let’s see each step to display the field into the Datafari UI and permit to search into the field with Solr.
1) Edit solrconfig.xml
We want to map the originam metadata of the source file : meta:author to a new Solr field named authorname.
So we have to edit the Solr cell request handler :
The correct syntax is meta_author (and not meta:author) because of the line
<str name="lowernames">true</str>
The documentation says : “lowernames=true|false – Map all field names to lowercase with underscores”
You can also see in the configuration that we store all the ignored metadata in the dynamic field ignored. I invite you to change the configuration of the field in the schema.xml and to change stored=false to store=true to see all the metadata found by Tika (and to see the correct syntax to map the fields into Solr)
For example :
2) Edit schema.xml
We want now to add the new field into the Solr schema. So add the following line :
Ok so far we can launch the indexation with ManifoldCF and the new field is well present in Solr.
3) Add the new field to the search
Edit solrconfig.xml, in the select request handler add the field :
After the core reloads, we can now search and find the data of the new field.
4) Configure the Datafari UI
into datafari/tomcat/webapps/Datafari/js/main.js (source code) or Datafari/tomcat/webapps/Datafari/js/main.js (installed version)
Change the line :
And add the field you want to add, here autorname
The last step is to change the Javascript file search.js :
datafari/WebContent/js/search.js (source code) or Datafari/tomcat/webapps/Datafarijs/search.js (installed version)
Add the display of your field adding the code : doc.subject where you want to add it. For example if you want to add it after the URL of the document :
(I made a mistake in my previous answer, it is correct now)
And finally your Datafari Ui should be like that with the field author at the end of each document :
Tell me if you still have some problems.
Best regards
Edit after user29296 another questions
Concerning wildcards
It depends what fieldtype you use and what type of search you need. Normally you do not have to put an additional wildcard before the word. You would need a ReversedWildcardFilterFactory if you want to search any prefix with a supplied suffix.
Retrieve data for multiple fields
I do not understand what is your problem in this case. Could you give me an example please ? If you change the search handler Select configuration you can add in the qf section the fields that you want to search into. So just add the meta_keyword field here. So the client will search into that field too when he performs a search.
Menu.css 404 error
This error does not have any impact on the application. The fix for this missing file will be included in the next update of Datafari.