I have a piece of text of 4226 characters (316 words + special characters)
I am trying different combinations of min_length and max_length to get summary
print(summarizer(INPUT, max_length = 1000, min_length=500, do_sample=False))
With the code:
The code is
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
INPUT = """We see ChatGPT as an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. As ChatGPT stated, large language models can be put to work as a communication engine in a variety of applications across a number of vertical markets. Glaringly absent in its answer is the use of ChatGPT in search engines. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The use of a large language model enables more complex and more natural searches and extract deeper meaning and better context from source material. This is ultimately expected to deliver more robust and useful results. Is AI coming for your job? Every wave of new and disruptive technology has incited fears of mass job losses due to automation, and we are already seeing those fears expressed relative to AI generally and ChatGPT specifically. The year 1896, when Henry Ford rolled out his first automobile, was probably not a good year for buggy whip makers. When IBM introduced its first mainframe, the System/360, in 1964, office workers feared replacement by mechanical brains that never made mistakes, never called in sick, and never took vacations. There are certainly historical cases of job displacement due to new technology adoption, and ChatGPT may unseat some office workers or customer service reps. However, we think AI tools broadly will end up as part of the solution in an economy that has more job openings than available workers. However, economic history shows that technology of any sort (i.e., manufacturing technology, communications technology, information technology) ultimately makes productive workers more productive and is net additive to employment and economic growth. How big is the opportunity? The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data. We expect the market to grow by 20% CAGR to reach USD 90bn by 2025. Given the relatively early monetization stage of conversational AI, we estimate that the segment accounted for 10% of the broader AI’s addressable market in 2020, predominantly from enterprise and consumer subscriptions. That said, user adoption is rapidly rising. ChatGPT reached its first 1 million user milestone in a week, surpassing Instagram to become the quickest application to do so. Similarly, we see strong interest from enterprises to integrate conservational AI into their existing ecosystem. As a result, we believe conversational AI’s share in the broader AI’s addressable market can climb to 20% by 2025 (USD 18–20bn). Our estimate may prove to be conservative; they could be even higher if conversational AI improvements (in terms of computing power, machine learning, and deep learning capabilities), availability of talent, enterprise adoption, spending from governments, and incentives are stronger than expected. How to invest in AI? We see artificial intelligence as a horizontal technology that will have important use cases across a number of applications and industries. From a broader perspective, AI, along with big data and cybersecurity, forms what we call the ABCs of technology. We believe these three major foundational technologies are at inflection points and should see faster adoption over the next few years as enterprises and governments increase their focus and investments in these areas. Conservational AI is currently in its early stages of monetization and costs remain high as it is expensive to run. Instead of investing directly in such platforms, interested investors in the short term can consider semiconductor companies, and cloud-service providers that provides the infrastructure needed for generative AI to take off. In the medium to long term, companies can integrate generative AI to improve margins across industries and sectors, such as within healthcare and traditional manufacturing. Outside of public equities, investors can also consider opportunities in private equity (PE). We believe the tech sector is currently undergoing a new innovation cycle after 12–18 months of muted activity, which provides interesting and new opportunities that PE can capture through early-stage investments."""
print(summarizer(INPUT, max_length = 1000, min_length=500, do_sample=False))
Questions I have are:
Q1: What does the following warning message mean? Your max_length is set to 1000, ...
Your max_length is set to 1000, but you input_length is only 856. You might consider decreasing max_length manually, e.g. summarizer(‘…’, max_length=428)
Q2: After above message this it publishes a summary of total 2211 characters. How did it get that?
Q3: Of the above 2211 characters, first 933 characters are valid content from text but then it publishes text like
For confidential support call the Samaritans on 08457 90 90 90 or
visit a local Samaritans branch, see www.samaritans.org for details.
For support …
Q4: How does min_length and max_length actually work (it does not seems to follow the restrictions given to it)?
Q5: What is the max input that I can actually give to this summarizer?
1
Answers
Q2: After above message this it publishes a summary of total 2211 characters. How did it get that?
A: The length that the model sees is not the no. of characters, so Q2 is out-of-scope question. It’s more appropriate to determine if the output of the model is shorter than the input no. of subwords tokens.
How we humanly decide no. of words is kinda different from how the model sees no. of tokens, i.e.
[out]:
We see that the input text you have in the example has 800 input subwords tokens, not 300 words.
Q1: What does the following mean?
Your max_length is set to 1000 ...
The warning message is as such:
Lets first try to put the input into the model and see the no. of tokens it outputs (without pipeline)
[code]:
[stderr]:
[stdout]:
Checking the output no. of tokens:
[out]:
Thus, the model summarizes the 800 subwords tokens input to an output of 73 subwords made up of 343 characters
Not sure how you got an output of 2k+ chars though, so lets try with pipeline.
[code]:
[out]:
Checking the size of the output:
[out]:
This is consistent with how we use the model without pipeline, 343 characters summary.
Q: Does that mean that I don’t have to set the
max_new_tokens
?Yeah, kind-of, you don’t have to do anything since the summary is already shorter than the input text.
Q: What does setting the
max_new_tokens
do?We know that the default output summary gives us 73 tokens. Lets try and see what happens if we set it down to 30 tokens!
[stderr]:
Ah ha, there’s some minimum length that the model wants to output as the summary!
So lets just try to set it to 60
[out]:
We see that now the summarized output is shorter than the 73 default output and fits into the 60 max_new_tokens limit we set.
And if we check the
print(len(outputs[0]))
, we get 61 subwords tokens, the additional one off the max_new_tokens is to account for the end of sentence symbol. If you print theoutputs
, you’ll see that the first token id is 2 which is represented by the</s>
token.When you specify the
skip_special_tokens=True
it will delete the</s>
token, as well as the start of sentence tokens<s>
.Q4: How does min_length and max_length actually work (it does not seems to follow the restrictions given to it)?
Given the above examples, the
min_length
is actually hard to determine since the model has to decide the minimum subwords tokens it needs to get a good summary output. Remember theUnfeasible length constraints: the minimum length (56) ...
warning?Q5: What is the max input that I can actually give to this summarizer?
The sensible
max_length
or more appropriatelymax_new_tokens
is most probably going to lower than your input length and if there’s some sort of UI limitations or compute/latency limitations, it’s best to keep it low and close to whatever is needed.I.e., to set the
max_new_tokens
, just make sure it’s lower than the input text no. of tokens and sensible enough for your application. If you want to know a ballpark no. try the model without setting the limit and see if the summary output is how you expect the model to behave, then adjust appropriately.Like seasoning while cooking, "Add/Reduce
max_new_tokens
as desired"Q3: Of the above 2211 characters, first 933 characters are valid content from text but then it publishes text like …
When setting the min_length to some arbitrarily large number, way larger than the default output of the model, i.e. 73 subwords,
Then it will warn you,
[sterr]:
It will start hallucinating things beyond the first 300-ish subwords tokens. Possibly, the model thinks that beyond 300-ish subwords, nothing else from the input text is important.
And output looks something like:
Q: Why did the model start hallucinating beyond 300 subwords?
Good question and also an active research area, see https://aclanthology.org/2022.naacl-main.387/ and there are many more in that area.
[Opinion]: Personally, hunch says, it’s most probably because most of the data that the model learnt from where the text is 800-ish subwords, the summary it trained are between the length of 80-300 subwords. And training data points where there are 300-500 subwords in the summary, it always contains the SOS helpline. So the model starts to overfit whenever it reaches that
min_length
that is >300.To prove the hunch pudding, try another random text of 800-ish subwords, and then set the min_length again to 500, it will most probably hallucinate the SOS sentence again beyond 300-ish subwords.