skip to Main Content

I have a MongoDB collection with around 20K articles with an "articledate" string field with dates in the "dd/mm/YYYY" format.

Want to query the collection and get the sum of the count of articles in each mm/YYYY.

This is the pipeline I have written which should also handle the date format:

# Aggregation pipeline to group documents by year and month and count them
pipeline = [
    {"$match": {"articledatepub": {"$exists": True}}},
    {
        "$addFields": {
            "date_object": {
                "$dateFromString": {"dateString": "$articledatepub", "format": "%m/%d/%Y"}
            }
        }
    },
    {
        "$group": {
            "_id": {
                "year": {"$year": "$date_object"},
                "month": {"$month": "$date_object"},
            },
            "count": {"$sum": 1},
        }
    },
    {"$sort": {"_id.year": 1, "_id.month": 1}},
]

but when I execute it the result is empty:

# Execute the aggregation pipeline
result = list(articles.aggregate(pipeline))

Here is a small fictional sample of the data:

{
  "_id": {
    "$oid": "641dddcd685f9e1cbbc73490"
  },
  "articleid": 2003287422,
  "articledatepub": "23/03/2023"
}
{
  "_id": {
    "$oid": "641dddcd685f9e1cccc73490"
  },
  "articleid": 2003283422,
  "articledatepub": "22/03/2023"
}
{
  "_id": {
    "$oid": "641dddcd687f9e1cccc73490"
  },
  "articleid": 2003383422,
  "articledatepub": "11/02/2023"
}

from which I’d expect a count of two articles for March 2023 and one for February 2023.

2

Answers


  1. Chosen as BEST ANSWER

    The code above would not work because of the DD/MM/YYYY format.

    Here is the working code which parses that date format "manually" with a revised pipeline:

    pipeline = [
        {"$match": {"articledatepub": {"$exists": True}}},
        {
            "$addFields": {
                "date_object": {
                    "$dateFromString": {
                        "dateString": {
                            "$concat": [
                                {"$substr": ["$articledatepub", 6, 4]},  # Year
                                "-",
                                {"$substr": ["$articledatepub", 0, 2]},  # Month
                                "-",
                                {"$substr": ["$articledatepub", 3, 2]},  # Day
                            ]
                        },
                    }
                }
            }
        },
        {
            "$group": {
                "_id": {
                    "year": {"$year": "$date_object"},
                    "month": {"$month": "$date_object"},
                },
                "count": {"$sum": 1},
            }
        },
        {"$sort": {"_id.year": 1, "_id.month": 1}},
    ]
    

  2. You just need to change the format parameter of $dateFromString to "%d/%m/%Y". Try this:

    pipeline = [
        {"$match": {"articledatepub": {"$exists": True}}},
        {"$project": {"date_object": {
            "$dateFromString": {"dateString": "$articledatepub", "format": "%d/%m/%Y"}}}},
        {"$project": {"year": {"$year": "$date_object"}, "month": {"$month": "$date_object"}}},
        {"$group": {"_id": {"year": "$year", "month": "$month"}, "count": {"$sum": 1}}},
        {"$sort": {"_id.year": 1, "_id.month": 1}},
    ]
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search