skip to Main Content

For my project I have to ingress and curate data from various sources.
I do that through the data-hub framework using flows.
All my different sources have a field called "date". However they all come in different forms e.g. yyyy-mm-dd, yyyymmdd, dd.mm.yyyy.

I do the curation through a mapping step to one common format yyyy-mm-dd.
After the mapping the field is still called "date".

Since I want to be able to do range searches I need an index on my "date". However this leads to error when ingesting the data as the ingested data’s "date" field is not yet mapped to the right format.

My solution was to not reject invalid values for the STAGING database. However because the old document is attached in the envelope of the curated new document that is moved to the FINAL database after the mapping I get a range index error for the attached document.

I want to reject invalid values in the FINAL database but I also want to keep the original document as an attachment in the final file.

The only solution I can see so far is to name the "date" element in the FINAL database something like iDate in order to avoid conflicts.

This doesn’t seem like a clean solution to me. Do you have better suggestions?

I am using:

  • centos 7
  • Marklogic 10
  • Marklogic data-hub 5.2.3

3

Answers


  1. If you use a path range index, you can limit it to just those date elements that are in the top-level instance and not in the attachment.

    See https://docs.marklogic.com/guide/admin/range_index#id_40666 for details on using path range indexes.

    Login or Signup to reply.
  2. Configure a path field to target only the normalized dates, excluding those original date elements that are in the envelope.

    https://docs.marklogic.com/guide/admin/fields#id_23934

    In a path field, the included and excluded elements are constrained to the sub-tree identified by the path. For example, if the path for the field is /A/B/C, only elements in node C, such as A/B/C/D, A/B/C/D/E and /A/B/C/Z, are included or excluded from the field.

    A path field may include one or more paths. Multiple paths are treated as the union of the paths. Consequently, each of them will identify a root of a field-instance in a given document.

    Login or Signup to reply.
  3. Giving your standardized "date" element a different name seems like a perfectly clean approach. Ideally, all elements of the same name (qualified by namespace) have the same definition and the same semantics. When your sources give you "dates" that are not actually xs:dates, then the "date" node being provided is really just a (slightly constrained) string. Sure, it describes a date in some easily-translated way, but it’s not actually a date field, and so it’s not indexable as a date field. When you standardize your source data, you’re creating a new data point, one which actually is a date, and it ‘should’ take a new identity: either a new local name as you suggest, or "date" in your "normalized data" namespace.

    <envelope>
      <original>
        <date>09-09-2020</date>
        <!-- etc -->
      </original>
      <normalized>
        <my:date>2020-09-09</my:date>
      </normalized>
    </envelope>
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search