Option to not extract metadata from PDF files for search

June 5, 2026

The extraction of metadata from PDF, RTF, and MS Office files for search can be controlled through global configurations.

OpenCms extracts text from documents of various formats for search. For PDF, RTF, and MS Office documents, metadata (e.g., author, title, keywords ...) is also extracted by default.

This information is partly not consciously maintained but is automatically set by programs. Additionally, it may contain information that should not be searched for at all. For example, the author of a document is usually irrelevant for search.

Via the global configuration file opencms-search.xml, the extraction of metadata per document type can be disabled using the parameter extract.metadata. Here is an example configuration for PDF documents:

<!-- ... -->
<documenttype>
    <name>pdf</name>
    <class>org.opencms.search.documents.CmsDocumentPdf</class>
    <param name="extract.metadata">false</param>
    <mimetypes>
        <mimetype>application/pdf</mimetype>
    </mimetypes>
    <resourcetypes>
        <resourcetype>binary</resourcetype>
        <resourcetype>plain</resourcetype>
    </resourcetypes>
</documenttype>
<!-- ... -->

The changed setting only affects newly indexed documents. To remove metadata for search from an existing document, there are two options:

Update the document, select "Rewrite content" and then publish it
Rebuild the search index

Note: The context menu option "Re-index" is not sufficient, because a caching mechanism prevents the update of the indexed content.