skip to Main Content

I am working on Social networking application with at least 0.2M users. On the application user can share anything from third parties as well as user can upload own media as post. There are different types of privacy

  • user privacy
    • user can be public
    • user can be private
      Any content shared or uploaded by the user will be in a box, and box also has different types of privacy
  • public box (Everyone can see the content of this box if you are public)
  • friend only box (Only your followers can see the content of this box)
  • private (Only you can see the content of this box)

Now the problem is that I have large data set. So when a user change his/her account privacy from public to private or private to public I have to update all the data according to privacy. Also, user can change the privacy of the box too.
So I need to update the user all shared posts of this box accordingly. But most of the time I failed to update due to framework and also technologies that I am using
Technologies that I am using

  • Lumen (PHP) microservice architecture
  • MySQL
  • Elasticsearch (For retrieval with joins)
  • Redis & Memcached
  • Postgres

When user shares anything on the platform the shared data is stored in the database and also data inserted in elasticsearch so all the data retrieved from elasticsearch with PHP client.

Now I want to define the architecture like Instagram that whenever user change account privacy or box privacy I have to change the content according to both privacies.

I read different types of articles but didn’t get any close idea for this. Kindly suggest any helpful article or idea to me.

2

Answers


  1. I would agree with @KolovosKonstantinos, you should try to model your application data to avoid updates of large data sets.
    Also it might be interesting for you to check our concept of Embedded vs Referenced Documents. Here are couple of nice posts on this topic :

    I would suggest trying following approach :

    • User entity has privacy property and different collections of post ids for every type of box
    • Every post stored as a separate document in elastic
    • You have different queries to select post ids from different collections based on the user privacy settings. When privacy of post changed simply move its id from one collection to another. Yes you need second query (sometimes called roundtrip) to storage to retrieve post when you know its id. That’s a tradeoff you make. Defining architecture is all about tradeoff.

    So you data might look like as following :

    User document:
    {
        "userId": "1",
        "privacy": "Public",
        "publicBoxPostIds":  [1,3],
        "friendsBoxPostIds": [2],
        "privateBoxPostIds": [],
    }
    
    Post documents:
    {
        "postId": "1",
        "postText": "zbzb",
         ...
    },
    {
        "postId": "2",
        "postText": "xcxc",
         ...
    }
    ...
    
    Login or Signup to reply.
  2. Handling a large dataset with complex privacy settings can indeed be a challenging task. It’s great that you’re considering an architecture similar to Instagram’s, as they have likely faced similar challenges at scale. Here are some suggestions and ideas that might help you:

    1. Use Asynchronous Processing:

      • When a user updates their privacy settings, consider using a queue-based system to process the updates asynchronously. This can be achieved using tools like RabbitMQ or Laravel Horizon.
      • Create background jobs for updating the necessary records in your database and Elasticsearch.
    2. Batch Processing:

      • Instead of updating each record individually, consider batching updates. This can help reduce the load on your system.
      • For example, you can schedule a job that runs periodically and updates a batch of records that need to be modified based on privacy settings.
    3. Event-Driven Architecture:

      • Implement an event-driven architecture where events are triggered when a user updates their privacy settings.
      • Use Laravel’s event broadcasting or a message queue system to notify relevant microservices about the changes.
    4. Denormalization:

      • Consider denormalizing your data to reduce the number of joins required when retrieving information. This can improve the performance of your system.
      • For example, store a copy of relevant user and box privacy settings directly in the post document in Elasticsearch.
    5. Caching:

      • Leverage Redis and Memcached for caching frequently accessed data. This can significantly reduce the load on your database and improve response times.
    6. Elasticsearch Indexing Strategy:

      • Optimize your Elasticsearch indexing strategy to ensure that updates are processed efficiently. Consider using the "update" API instead of reindexing entire documents.
    7. Data Sharding:

      • If your dataset continues to grow, consider implementing data sharding to distribute the load across multiple database instances.
    8. Monitoring and Optimization:

      • Regularly monitor the performance of your system using tools like New Relic or Laravel Telescope. Identify bottlenecks and areas for optimization.
    9. Documentation and Testing:

      • Document your architecture and update processes comprehensively. Ensure that your update processes are thoroughly tested to handle edge cases and unexpected scenarios.

    I don’t have a specific article to point you to, but exploring articles on large-scale system design, microservices, and asynchronous processing could provide valuable insights. Additionally, Instagram’s engineering blog might have articles discussing some of the challenges they faced and how they addressed them.

    Remember to always test your updates in a staging environment before applying them to your production environment, especially when dealing with critical privacy-related data.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search