skip to Main Content

Some of my colleagues and I have started to work on an iPhone application that provides a social buying experience for the user. The goal is to provide the user with extended search capabilities (full text, fuzzy search, based on filters etc) on millions of products that are constantly fetched from several product listings APIs (such as eBay and Amazon), then normalized (i.e. transformation of fields, categories and relations), applied with some business logic so that users will be able to get customized content based on several criteria (unique profile i.e. age/gender, searches history, what my friends bought etc).
The application also have social features such as posts, likes and reviews about the products, following other users etc.

So now we are trying to design the server architecture that will support these needs and among other things there are the performance considerations (“GIVE ME all the products THAT match my search word AND ORDER them by relevancy” should run pretty fast ~ 1 to 10 seconds) and the scalability consideration (10 consequence users will get a result in the same amount of time as 100,000 users, providing I can throw more machines to tackle the issue).

We assume we will have ~ tens-hundreds of millions of products

What we had in mind is (based on AWS):

  1. Set up Elastic Beanstalk to support scalability by throwing more EC2 instances whenever the traffic grows and taking them down when it diminishes
  2. Set up RDS with MySQL as the RDBMS for the application (manage users, profiles, normalized products etc) with several availability zones
  3. Set up a background “agent” process on a different server to constantly fetch products data from the APIs (having a customizable fetching Que)
  4. Store the above “raw data” inside some NoSQL as temp data
  5. Set up another “agent” for normalization of the data, profiling it and insert it inside the RDBMS in a way that will able very quick searches that are already based on the user distinct profiles
  6. Set up caching mechanism to reduce the loads on the RDBMS
  7. Set up a good full text search engine (i.e. Lucene)

Our main considerations are:

  1. Linux environment
  2. Mainly PHP and MySQL
  3. Performance is an issue
  4. Scalability will become an issue in the near future (6-12 months) (hopefully :))

Now several questions:

  1. Is the architecture makes sense?
  2. Regarding the data storage – is RDBMS is the right choice or maybe we should consider a NoSQL engine (i.e MongoDB)?
  3. What techniques/approaches should we consider upon tackling this problem?

By the way, war stories would be much appreciated 🙂

3

Answers


  1. Two comments, not pretending for a full answer so far.

    RDBMS vs NoSQL

    • NoSQL seems a better option to me as you do not need strict control over data completeness all the time.

    • You also do not care if a Product X changed its place in ranking within last 5-10 minutes, nor slight change in user preferences used for search.

    • And you’ll have NoSQL db anyway.

    This is why RDBMS seems a bit too much.

    Performance.

    • You’ll probable need several data preparation servers to distribute workload.

    • You can group users and partition them across different servers by their use patterns poor preferences. You can think of this in advance.

    • Design an Ideal model of serving user requests. Knowing how many queries you can serve per instance/machine/CPU think how would it work. You can amend it later and see differences between your anticipation and real user behaviour.

    Login or Signup to reply.
    1. Yes
    2. Depends on how structured you wish to represent your data at the storage level. If you build that structure up in memory, or using Lucene for search look at NoSQL options (Dynamo for AWS).
    3. Look at using a Hadoop cluster for normalising your data in a timely fashion.
    Login or Signup to reply.
  2. I think for what you’ve described, you will probably want to avoid Elastic Bean Stalk, and deploy right onto EC2 instance that you control.

    Front end will run web load, and mostly query from cache. This can be behind an elastic load balancer, and you can use autoscaling rules to ensure you always have enough resources to handle the load.

    I would probably look at solr for full text search, but I’m not an expert in this – I think solr will have some of the scalibility, replication, etc to make managing your search infrastructure a bit easier to manage. There are some good AWS Solr reference architectures that are designed to scale.

    It sounds like you will need a couple of back end service layers – one to pull in the data, another to normalize it. If you are going to commit to AWS, you can probably build these so that a central control process doles out work to instances you get via the spot market – that can help to reduce the overall costs. If the spot market spikes, you can choose to either slow down importing/processing, or use on-demand instances and increase the costs a bit.

    I’d probably design this to use a combination of mysql and a no-sql store. Mysql for core functionality – accounts, user preferences, etc, but NoSQL for the product information. You probably want to store that in a format that can be used directly by the UI with minimal processing. Properly designed, this should allow sharding of the NoSQL store, which will help scalibility, although you’ll need a way to reproduce data if a node goes down.

    To handle the relationship between products and related data (comments, posts, etc) you will need to associate them with whatever key is used to retrieve them from the NoSQL store. If you are going to be dealiing with millions and millions of product records, you will probably want to determine your data retention requirements – do you really need to keep details of a product that has been obsolete and/or unavailable for years?

    If search is going to be the primary interface to the data, however, you may not need a NoSQL solution – simply pull what you need back from solr.

    You can put caching in front of most of these layers.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search