I have a function in my javascript code that loops through an array and performs some time-taking actions on each item of the array. It works fine for now when the number of items is low in the array but I also want the code to work if the the array is larger. Here is my function:
const fetchAndProcessNews = async (queryString, from) => {
const query = {
queryString,
from,
size: 1,
}
try {
console.log('Fetching news...')
const { articles } = await searchApi.getNews(query)
console.log('total articles fetched:', articles.length)
console.log('Fetched news:', articles)
if (articles && articles.length > 0) {
console.log('Processing news...')
//looping through all the articles fetched from api
for (const article of articles) {
console.log('Processing article with name: ', article.title)
const { title, sourceUrl, id, publishedAt } = article
//scraping content from the source url and returning the markup of the single article
const markup = await scraper(sourceUrl)
//using gpt to perform some tasks on the markup returned from scrapping
const data = await askGpt(markup)
//using dall e to generate an image
const generatedImageUrl = await generateImg(data?.imageDescription)
//downloading the image from the url and uploading it to s3
const s3ImageUrl = await generateImgUrl(generatedImageUrl, title, id)
//uploading the article to strapi using post request
const newTitle = data?.title
const newMarkup = data?.content
const description = data?.abstract
const categories = data?.categories
console.log('pushing article to strapi')
await createPost(
newTitle,
description,
newMarkup,
s3ImageUrl,
publishedAt,
categories
)
console.log('article processsing completed...')
}
} else {
console.log('No articles found')
}
} catch (error) {
console.error('Error fetching news:', error.message)
}
}
let me explain what I am doing i am fetching some news articles from an api and for every article I perform these tasks:
- scrape the content using a URL provided by the api using Cheerio which takes some time
- use open AI to perform some tasks on the markup which also takes a lot of time.
- generate an image using dall e which also takes time
- then i upload the image to s3
- then I upload all the things to strapi using post-request
Now I am worried about if the number of articles is let’s say 100 or 1000 how will this code work? will it be able to handle all of these time-consuming tasks? How can I make it more optimal so that it does not crash and works normally? I don’t have that much experience that’s why I am a little bit worried. What techniques should I use? Should I use some kind of queue like bull js or batch processing? Please if somebody can provide a detailed answer it will be a big help.
2
Answers
The idea is to not wait for all steps of article processing to complete before starting the processing of a next article.
You could for instance start with a next article at each passing of 100 milliseconds. You could even start them all without delay, but the risk is that you may make too many requests to the same server, hitting some server limit. So it is probably more prudent to have a slight delay between the initiations of article processing. To have such intermediate delay, you could use this general purpose function:
The easiest way to accomplish the overall idea, is to put the code for article processing in a separate function:
No code was changed here; it was just moved into a function.
Now your main function can execute the above function without awaiting it. Instead it can capture the promise it returns (which will be pending), and collect such promises in an array. This means several requests will now be made for different articles without awaiting their responses. Finally you will probably want to await that all these promises have settled.
Here is how your original function would look:
The
await Promise.allSettled
operation is optional, but it will be useful for the caller offetchAndProcessNews
, as then the promise they get in return will only resolve when all is done.On a final note, you’ll probably want to improve the
console.log
output made inprocessArticle
as now they will get interwoven, and it will be useful to see which article the messages "pushing article to strapi" and "article processsing completed…" are about.One might try both …
separating the asynchronous/deferred post-data calculation of to be scraped articles from uploading the calculated post-data
and parallelizing post-data calculations and post api-calls alike.
A possible solution first does implement a function which calculates any to be scraped article’s post-data.
A convenient approach for parallelizing post-data calculations while also trying to not entirely overload/block the API might come with the combined usage of an async generator function, the returned async generator and
Promise.all
.One does create batches of parallel deferred post-data calculations by
splice
ingN
article items at time from an array of to be scraped articles (the ones initially derived from the first api-call).As long as there are article-items within the constantly mutated article-array, the async generator yields a promise of all
N
asynchronous/deferred post-data calculations whereN
is the batch size.Iterating the async generator with the
for await...of
statement gives access to an array of resolved post-data. Thus, one now can try parallelizing the post api calls as well by creating a promise of all api-calls, each having gotten passed its relatedargs
-array of resolved post-data.