Option to not extract metadata from PDF files for search
OpenCms extracts text from documents of various formats for search. For PDF, RTF, and MS Office documents, metadata (e.g., author, title, keywords ...) is also extracted by default.
This information is partly not consciously maintained but is automatically set by programs. Additionally, it may contain information that should not be searched for at all. For example, the author of a document is usually irrelevant for search.
Via the global configuration file opencms-search.xml, the extraction of metadata per document type can be disabled using the parameter extract.metadata. Here is an example configuration for PDF documents:
<!-- ... -->
<documenttype>
<name>pdf</name>
<class>org.opencms.search.documents.CmsDocumentPdf</class>
<param name="extract.metadata">false</param>
<mimetypes>
<mimetype>application/pdf</mimetype>
</mimetypes>
<resourcetypes>
<resourcetype>binary</resourcetype>
<resourcetype>plain</resourcetype>
</resourcetypes>
</documenttype>
<!-- ... -->
The changed setting only affects newly indexed documents. To remove metadata for search from an existing document, there are two options:
- Update the document, select "Rewrite content" and then publish it
- Rebuild the search index
Note: The context menu option "Re-index" is not sufficient, because a caching mechanism prevents the update of the indexed content.