Insights
Introduction
The Insights accelerator is a collection of flows that can be used to analyze the data in the Content Store.
To use the Insights accelerator, you need to have Elasticsearch installed and running.
For more comprehensive information on how to configure Elasticsearch, please refer to the Elasticsearch guide.
For a quick installation guide follow Quick Installation Steps of Elasticsearch & Kibana
chapter below.
The first flow Insights (1. Analysis)
is the flow that performs the analysis on the Content Store and enriches the data with insights. The second flow Insights (2. Ingest)
is the flow that ingests the data into Elasticsearch.
One of the supported analysis, is the analysis of duplicates. In order to calculate the duplicates, the BINARY documents in the Content Store must contain a hash. If the BINARY documents do not contain a hash, the duplicates analysis will not work as expected. You can use the Calculate hash accelerator to calculate the hash of a BINARY documents.
After the data is ingested into Elasticsearch you can use Kibana to visualize the data.
For more comprehensive information on how to install and run Kibana, please refer to the Kibana guide.
For a quick installation guide follow Quick Installation Steps of Elasticsearch & Kibana
chapter below.
As part of this accelerator, a pre-configured set of Kibana dashboards is provided in the same folder where you found this accelerator.
These dashboards can be imported into Kibana. For more information on how to import dashboards into Kibana, please refer to the Kibana managing saved objects.
A wealth of insights can be obtained, including but not limited to:
- Insights into the total number and size of the content
- Insights into the structure of the content
- Insights into versions and translations
- Insights into duplicates
- Insights quality and completeness of metadata
- Insights into the content lifecycle
Quick Installation Steps
1. Download Elasticsearch and Kibana
2. Unzip downloaded packages in the Software Repository
3. Run Elasticsearch for the first time
- Navigate to the Elasticsearch root directory (e.g.,
D:\Software\elasticsearch-8.8.1-windows-x86_64\elasticsearch-8.8.1
). - In the address bar, type
cmd
and pressEnter
. - In the Command Prompt, type
bin\elasticsearch
and pressEnter
.
4. Elasticsearch password and Kibana enrollment token
- Elasticsearch password will be shown in the Elasticsearch terminal under
Password for the elastic user
. - Elasticsearch username is by default
elastic
. - Kibana enrollment token will be shown in the terminal under
Copy the following enrollment token and paste it into Kibana...
. - Store both values in a safe location.
5. Run Kibana for the first time
- Navigate to the Kibana root directory (e.g.,
D:\Software\kibana-8.8.1-windows-x86_64\kibana-8.8.1
). - In the address bar, type
cmd
and pressEnter
. - In the Command Prompt, type
bin\kibana-setup --enrollment-token <your enrollment token>
and pressEnter
twice! - Make sure to change the enrollment-token value accordingly. When successful, in the same terminal type
bin\kibana.bat
and pressEnter
.
6. Validate
- Go to Elasticsearch Homepage, enter the
username
andpassword
. Done!
- Requires Elasticsearch 8.8.1 and Kibana 8.8.1.
- It is recommended to run the Elasticsearch and Kibana instances on a separate machine.
- After running Elasticsearch for the first time please activate Kibana with the enrollment token within 30 minutes.
Re-running Insights
- The first flow can be re-run to re-analyze the data in the Content Store.
- The second flow can be re-run to re-ingest the data into Elasticsearch.
- If you want to re-ingest the data into Elasticsearch, it is recommend to delete the index in Elasticsearch first.
- For more information on how to delete an index in Elasticsearch, please refer to the Elasticsearch indices delete index.
- Or navigate to
Stack Management > Index Management
in Kibana.
Metadata
This chapter describes Quality analysis
, Custom metadata
, Elasticsearch data mapping
.
Quality analysis
In addition to the pre-configured set of dashboards, Kibana offers a feature called Field Statistics
that can be used to analyze metadata quality and completeness. For more information on how to use Field Statistics
, please refer to the Kibana documentation.
Custom metadata
By default the Insights accelerator is configured to analyze the data in the Content Store
metadata. If you want to analyze custom metadata, you can modify the Insights (2. Ingest)
flow to include the custom metadata by editing the two Template Engine
components in the flow. Store the custom metadata in the metadata
key as an object. For example:
{
"operation": ...
"data": {
... // Content Store metadata
"metadata": {
"customMetadata": "customMetadataValue"
}
}
}
The Insights accelerator has the following limitations:
- Only the primary binary of a record is analyzed
- If a document has multiple parents, only the first parent is analyzed
Elasticsearch data mapping
Name | Mandatory | Type | Description | Content Store Path |
---|---|---|---|---|
_id | yes | Unique identifier | _id | |
id | yes | keyword | Id of object or full path if file system | source versionInfo seriesId |
type | yes | keyword | Type of object, e.g. CONTAINER | kind |
subType | yes | keyword | SubType to specify a more detailed type, e.g. container | source contentType systemName |
is File | yes | boolean | Specifies if an object is/contains a binary | hardcoded |
source | yes | keyword | Name of the source system | migration origin |
name | yes | text | Name of the object. | source name systemName |
name keyword | automatically calculated | keyword | ||
name length | automatically calculated | token_count | ||
name* | no | keyword | Fields with the prefix "name" will be automatically indexed | |
description | no | text | Description of the object, e.g. title field | source description |
state | yes | keyword | One or more state values, e.g. hidden | source states |
hierarchy | yes | text | Full and unique hierarchy to the object | source hierarchies |
hierarchy keyword | automatically calculated | keyword | ||
hierarchy length | automatically calculated | token_count | ||
hierarchy depth | automatically calculated | token_count | ||
hierarchy tree | automatically calculated | text | Can be used for special search use cases | |
hierarchy treeReversed | automatically calculated | text | Can be used for special search use cases | |
url | no | keyword | Contains the full web url in case of an ECM system | |
parent | yes | keyword | Full parent hierarchy | Parent path of source hierarchies |
parent tree | automatically calculated | text | Can be used for aggregation on the structure | |
parent Id | no | keyword | Unique id of the parent object | source parentIds |
date Created | yes | date | Creation date of the object | source created date |
date Modified | yes | date | Last modified date of the object | source lastModified date |
date Accessed | no | date | Last accessed date of the object | source lastAccessed date |
date* | no | date | Fields with the prefix "date" will be automatically indexed | |
principal Created | no | text | Principal that created the object, e.g. group | |
principal Created keyword | automatically calculated | keyword | ||
principal Modified | no | text | Principal that last modified the object, e.g. user | |
principal Modified keyword | automatically calculated | keyword | ||
principal Accessed | no | text | Principal that last accessed the object, e.g. user | |
principal Accessed keyword | automatically calculated | keyword | ||
principal Author | no | text | The author of the object | |
principal Author keyword | automatically calculated | keyword | ||
principal Owner | no | text | The owner of the object | |
principal Owner keyword | automatically calculated | keyword | ||
principal* | no | text | Fields with the prefix "principal" will be automatically | |
binary Extension | yes (for files) | keyword | Extension of the binary, if empty then null | source binaries source rawExtension |
binary Extension normal | automatically calculated | keyword | Normalized version of the value | |
binary Extension length | automatically calculated | token_count | ||
binary Byte Size | yes (for files) | long | Size in bytes of the binary | source binaries source byteSize |
binary Hash | no | keyword | Hash of the binary | source binaries source hash |
reversed Version Order | yes | integer | A number specifying the order of versions in a reversed manner | reversedVersionOrder |
version Count | yes | integer | Number of versions for the object, including the current version | versionCount |
language | yes | keyword | Specifies the language of an object | source language systemName |
is Original Language | yes | boolean | Specifies if the object is in the original language | source language masterReference |
analytics Binary Hash Count | automatically calculated | integer | Calculated field containing the amount of times a hash exists | source binaries source properties insights hashCount |
analytics Binary Unique | automatically calculated | boolean | Calculated field set to true for only one of the objects per hash | source binaries source properties insights binaryUnique |
analytics Binary ParentUnique | automatically calculated | boolean | Calculated field set to true for only one of the objects per hash sharing the same parent | source binaries source properties insights binaryParentUnique |
analytics Has Children | automatically calculated | boolean | Calculated field set to true if the object has child objects | source properties insights hasChildren |
analytics Classification.* | keyword | The results of the classification process based on the binaryExtension, grouped by binaryFileSize | source binaries source properties insights analyticsClassification | |
analytics Translation Count | no | integer | Calculated field containing the amount of times an object is translated | |
analytics Available Translations | no | keyword | Calculated field containing all available translations for an object | |
purview Applied Label | no | keyword | The label applied by Purview | source properties purview appliedLabel |
purview Information Types | no | keyword | The information types found by Purview | source properties purview informationTypes |
migrate | yes | boolean | Whether the document will be migrated | migration migrate |
migration Id | yes | keyword | The Id in the target system after migration | migration id |
migration Failed | yes | boolean | Indicates if the migration failed for this document | migration failed |
migration Failed Message | yes | text | Indicates the reason for a failed migration | migration failedMessage |
metadata | no | object | Object field to store any additional metadata |
When storing any additional fields in the metadata
object, the type of the first value decides the type for this field. Changes in the field type will cause validation errors. For example:
- The first document has a
metadata.registrationDate
field with the value"20th September 2021"
. The type for this field will bekeyword
as it is astring
. - Later on we have a document with the field
metadata.registrationDate
, but in this case the value is"2021-09-206T21:50:28.342Z"
. The type for this field will be date. This will cause a validation error.
The documents are by default ingested in batches of 1000 documents, if one of the documents fails, due to validation error as mentioned above, the whole batch won't be imported.
For more information on different field data types in Elasticsearch, please refer to the Elasticsearch field data types
Flows
This chapter describes configuration of the flows Insights (1. Analysis)
and Insights (2. Ingest)
.
Insights (1. Analysis) settings
Performs the analysis on the Content Store and enriches the data with insights
mongoConnection
The Mongo connection string including the database name to connect to.
Insights (2. Ingest) settings
Ingests the data into Elasticsearch.
You can easily change the data that is Ingested by modifying the Document Retrieve
component in the flow.
The default query is:
{
"kind": {
"$in": ["CONTAINER", "RECORD"]
},
"source.versionInfo.isCurrent": true
}
You can extend this query to include additional filters. For example:
{
"kind": {
"$in": ["CONTAINER", "RECORD"]
},
"source.versionInfo.isCurrent": true,
"source.contentType.systemName": "myContentType"
}
The isCurrent
filter is required as the checkbox Include source versions
is checked.
elasticsearchConnection
Elasticsearch connection string
Example: http://localhost:9200
elasticsearchUsername
Elasticsearch username
elasticsearchPassword
Elasticsearch password
elasticsearchCertificatePath
Path to the Elasticsearch certificate
Example: C:\certificates\elasticsearch\elasticsearch.crt
mongoConnection
The Mongo connection string including the database name to connect to.
indexName
The name of the index to use in Elasticsearch