skip to Main Content

I have a function in my javascript code that loops through an array and performs some time-taking actions on each item of the array. It works fine for now when the number of items is low in the array but I also want the code to work if the the array is larger. Here is my function:

const fetchAndProcessNews = async (queryString, from) => {
  const query = {
    queryString,
    from,
    size: 1,
  }
  try {
    console.log('Fetching news...')
    const { articles } = await searchApi.getNews(query)
    console.log('total articles fetched:', articles.length)
    console.log('Fetched news:', articles)
    if (articles && articles.length > 0) {
      console.log('Processing news...')
      //looping through all the articles fetched from api
      for (const article of articles) {
        console.log('Processing article with name: ', article.title)
        const { title, sourceUrl, id, publishedAt } = article
        //scraping content from the source url and returning the markup of the single article
        const markup = await scraper(sourceUrl)
        //using gpt to perform some tasks on the markup returned from scrapping
        const data = await askGpt(markup)
        //using dall e to generate an image
        const generatedImageUrl = await generateImg(data?.imageDescription)
        //downloading the image from the url and uploading it to s3
        const s3ImageUrl = await generateImgUrl(generatedImageUrl, title, id)
        //uploading the article to strapi using post request
        const newTitle = data?.title
        const newMarkup = data?.content
        const description = data?.abstract
        const categories = data?.categories

        console.log('pushing article to strapi')
        await createPost(
          newTitle,
          description,
          newMarkup,
          s3ImageUrl,
          publishedAt,
          categories
        )
        console.log('article processsing completed...')
      }
    } else {
      console.log('No articles found')
    }
  } catch (error) {
    console.error('Error fetching news:', error.message)
  }
}

let me explain what I am doing i am fetching some news articles from an api and for every article I perform these tasks:

  1. scrape the content using a URL provided by the api using Cheerio which takes some time
  2. use open AI to perform some tasks on the markup which also takes a lot of time.
  3. generate an image using dall e which also takes time
  4. then i upload the image to s3
  5. then I upload all the things to strapi using post-request

Now I am worried about if the number of articles is let’s say 100 or 1000 how will this code work? will it be able to handle all of these time-consuming tasks? How can I make it more optimal so that it does not crash and works normally? I don’t have that much experience that’s why I am a little bit worried. What techniques should I use? Should I use some kind of queue like bull js or batch processing? Please if somebody can provide a detailed answer it will be a big help.

2

Answers


  1. The idea is to not wait for all steps of article processing to complete before starting the processing of a next article.

    You could for instance start with a next article at each passing of 100 milliseconds. You could even start them all without delay, but the risk is that you may make too many requests to the same server, hitting some server limit. So it is probably more prudent to have a slight delay between the initiations of article processing. To have such intermediate delay, you could use this general purpose function:

    const delay = ms => new Promise(resolve => setTimeout(resolve, ms));
    

    The easiest way to accomplish the overall idea, is to put the code for article processing in a separate function:

    const processArticle = async (article) = {
        console.log('Processing article with name: ', article.title)
        const { title, sourceUrl, id, publishedAt } = article
        //scraping content from the source url and returning the markup of the single article
        const markup = await scraper(sourceUrl)
        //using gpt to perform some tasks on the markup returned from scrapping
        const data = await askGpt(markup)
        //using dall e to generate an image
        const generatedImageUrl = await generateImg(data?.imageDescription)
        //downloading the image from the url and uploading it to s3
        const s3ImageUrl = await generateImgUrl(generatedImageUrl, title, id)
        //uploading the article to strapi using post request
        const newTitle = data?.title
        const newMarkup = data?.content
        const description = data?.abstract
        const categories = data?.categories
    
        console.log('pushing article to strapi')
        await createPost(
          newTitle,
          description,
          newMarkup,
          s3ImageUrl,
          publishedAt,
          categories
        )
        console.log('article processing completed...')
    };
    

    No code was changed here; it was just moved into a function.

    Now your main function can execute the above function without awaiting it. Instead it can capture the promise it returns (which will be pending), and collect such promises in an array. This means several requests will now be made for different articles without awaiting their responses. Finally you will probably want to await that all these promises have settled.

    Here is how your original function would look:

    const fetchAndProcessNews = async (queryString, from) => {
      const query = {
        queryString,
        from,
        size: 1,
      }
      try {
        console.log('Fetching news...')
        const { articles } = await searchApi.getNews(query)
        console.log('total articles fetched:', articles.length)
        console.log('Fetched news:', articles)
        if (articles && articles.length > 0) {
          console.log('Processing news...')
          // looping through all the articles fetched from api
          const promises = [];
          for (const article of articles) {
            promises.push(processArticle(article)); // We don't await!
            await delay(100); // Determine which delay is suitable
          }
          // All articles are now being processed; wait for all to finish (optional)
          await promise.allSettled(promises);
        } else {
          console.log('No articles found')
        }
      } catch (error) {
        console.error('Error fetching news:', error.message)
      }
    }
    

    The await Promise.allSettled operation is optional, but it will be useful for the caller of fetchAndProcessNews, as then the promise they get in return will only resolve when all is done.

    On a final note, you’ll probably want to improve the console.log output made in processArticle as now they will get interwoven, and it will be useful to see which article the messages "pushing article to strapi" and "article processsing completed…" are about.

    Login or Signup to reply.
  2. One might try both …

    • separating the asynchronous/deferred post-data calculation of to be scraped articles from uploading the calculated post-data

    • and parallelizing post-data calculations and post api-calls alike.

    A possible solution first does implement a function which calculates any to be scraped article’s post-data.

    A convenient approach for parallelizing post-data calculations while also trying to not entirely overload/block the API might come with the combined usage of an async generator function, the returned async generator and Promise.all.

    One does create batches of parallel deferred post-data calculations by spliceing N article items at time from an array of to be scraped articles (the ones initially derived from the first api-call).

    As long as there are article-items within the constantly mutated article-array, the async generator yields a promise of all N asynchronous/deferred post-data calculations where N is the batch size.

    Iterating the async generator with the for await...of statement gives access to an array of resolved post-data. Thus, one now can try parallelizing the post api calls as well by creating a promise of all api-calls, each having gotten passed its related args-array of resolved post-data.

    async function getDeferredArticlePostData(article) {
      const { title, sourceUrl, id, publishedAt } = article;
    
      console.log(`Get post-data of article "${ title }".`);
    
      // nothing here which could be further parallelized.
      const markup = await scraper(sourceUrl);
      const data = await askGpt(markup);
    
      const generatedImageUrl = await generateImg(data?.imageDescription);
      const s3ImageUrl = await generateImgUrl(generatedImageUrl, title, id);
    
      // the post-data derieved from a scraped article.
      return [
        data?.title, data?.abstract, data?.content,
        s3ImageUrl, publishedAt, data?.categories,
      ];
    }
    
    async function* createDeferredBatchesOfParallelRetrievedArticlePostData(
      articleList = [], batchSize = 4, // play with the batch size.
    ) {
      while (articleList.length >= 1) {
    
        // - try to parallelize the calculation of article post-data by creating
        //   `Promise.all` based batches of deferred post-data calculations.
    
        yield Promise.all(
          articleList.splice(0, batchSize).map(getDeferredArticlePostData)
        );
      }
    }
    
    async function fetchAndProcessNews(queryString, from) {
      const query = { queryString, from, size: 1 };
    
      try {
        const { articles } = await searchApi.getNews(query);
    
        if (articles && articles.length > 0) {
    
          const deferredPostDataPool =
            createDeferredBatchesOfParallelRetrievedArticlePostData(...articles);
    
          // ... separate the (parallelized and batched) post-data calculation ...
    
          for await (const listOfResolvedPostData of deferredPostDataPool) {
    
            // ... from posting an article's new data;
            //     but here too, trying to parallelize the post api calls.
    
            // - one even could give it a try to intentionally omit the
            //   awaiting of `Promis.all`.
    
            /* await */Promise.all(
              listOfResolvedPostData.map(postData => createPost(...postData))
            )        
          }
        } else {
          console.log('No articles found.');
        }
      } catch (error) {
        console.error('Error while fetching news:', error.message);
      }
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search