Hostname: page-component-cd9895bd7-dk4vv Total loading time: 0 Render date: 2024-12-24T13:00:32.181Z Has data issue: false hasContentIssue false

Emerging trends: General fine-tuning (gft)

Published online by Cambridge University Press:  23 May 2022

Kenneth Ward Church*
Affiliation:
Baidu, Sunnyvale, CA, USA
Xingyu Cai
Affiliation:
Baidu, Sunnyvale, CA, USA
Yibiao Ying
Affiliation:
Baidu, Beijing, China
Zeyu Chen
Affiliation:
Baidu, Beijing, China
Guangxu Xun
Affiliation:
Baidu, Sunnyvale, CA, USA
Yuchen Bian
Affiliation:
Baidu, Sunnyvale, CA, USA
*
*Corresponding author. E-mail: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

This paper describes gft (general fine-tuning), a little language for deep nets, introduced at an ACL-2022 tutorial. gft makes deep nets accessible to a broad audience including non-programmers. It is standard practice in many fields to use statistics packages such as R. One should not need to know how to program in order to fit a regression or classification model and to use the model to make predictions for novel inputs. With gft, fine-tuning and inference are similar to fit and predict in regression and classification. gft demystifies deep nets; no one would suggest that regression-like methods are “intelligent.”

Type
Emerging Trends
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2022. Published by Cambridge University Press

1. Introduction

This paper introduces gft (general fine-tuning),Footnote 1 a little languageFootnote 2 for deep nets, introduced at an ACL-2022 tutorial.Footnote 3 There are two parts to the tutorial:

  1. 1. Glass is half-full: make deep nets accessible to a mass audience, including nonprogrammers, and

  2. 2. Glass is half-empty: based on the successes of the first part on so many benchmarks, one might come to the mistaken impression that deep nets are more successful than they are. There are always opportunities for improvement. We are advocating an interdisciplinary approach that combines the successes in the first part, with decades of work in AI representation and centuries of work in linguistics and philosophy.

This paper will use gft to discuss the first part. It is amazing how much can be done with so little. gft demystifies deep nets. No one would suggest that regression-like methods are “intelligent.”

There are two main functions in gft: fit and predict. Fit takes a pretrained model, $f_{pre}$ , as input, and fine-tunes that on data to produce a post-trained model, $f_{post}$ , as output. Predict takes x, a novel input, and predicts, $\hat{y}=f(x)$ . Hopefully, the prediction, $\hat{y}$ , will be close to the gold label, y.

We discussed deep nets in two previous articles in this journal: (Church et al., Reference Church, Chen and Ma2021a, b). gft makes it possible to do much of that in short (1-line) programs. 1-line programs are easier to read, write, understand, and port from one environment to another than examples on hubs (typically hundreds of lines of Python, PyTorch,Footnote 4 TensorFlow,Footnote 5 JaxFootnote 6, and/or PaddlePaddle).Footnote 7

gft is designed to make much of this functionality accessible to nonprogrammers. Just as one does not need to know Python and Machine Learning to use an off-the-shelf regression package, so too, deep nets should not require much (if any) programming skills.

Following the advice in “Crossing the Chasm” (Moore and McKenna, Reference Moore and McKenna1999), the long-term success of deep nets will depend on finding ways to cross the chasm from the current set of loyal users (so-called early adopters) to a much larger set of users. Early adopters may be willing to invest in machine learning and programming, but most users have other priorities.

The gft interpreter is based on examples from hubs.Footnote 8 , Footnote 9 Hubs encourage users to modify hundreds of lines of Python code as necessary if they want to change models, data sets, and/or tasks. gft generalizes the examples so users can do much of that in a single line of gft code (with comparable performance).

gft supports most of the arguments in the examples on the hubs, so it is possible to tune hyper-parameters such as batch size, learning rate, and stopping rules. Tuning matters for (state of the art) SOTA-chasing, though default settings are recommended for most users who prefer results that are easy to replicate and reasonably competitive.

There is already too much SOTA-chasing in the literature (Church and Kordoni, Reference Church and Kordoni2022). Users should avoid wasting time on hyper-parameter tuning unless they are about to ship a model to a large number of users for an application where small improvements in performance are worth the effort.

2. gft Cheatsheet

gft supports the following functions:Footnote 10

  1. 1. fit (also known as fine-tuning): $f_{pre} + data \rightarrow f_{post}$

  2. 2. predict (also known as inference): $f(x) = \hat{y}$ , where x is an input from stdin or from a data set

  3. 3. eval: $f + data \rightarrow score$ (produce a single score for a data set split, as opposed to a prediction, $\hat{y}$ , for each input row in the split, x)

  4. 4. summary: Search hubs for popular data sets, models, and tasks and provide snippets. Popularity is estimated from metrics on downloads.

  5. 5. cat_data: Output data set on stdout

There are four major arguments:

  1. 1. –data: a data set on a hub, or a local file

  2. 2. –model: a model on a hub, or a local file

  3. 3. –task: for example, classify, regressFootnote 11

  4. 4. –eqn (e.g., classify: $y \sim x_1 + x_2$ ), where a task appears before the colon, and variables refer to columns in the data set.

3. The standard recipe

Following (Howard and Ruder, Reference Howard and Ruder2018; Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019), it has become standard practice to use the 3-step recipe in Table 1. We prefer the terms, fit and predict, to fine-tuning and inference. The proposed terminology has a long tradition in statistics and predates relatively recent work on deep nets.Footnote 12

Table 1. The standard recipe consists of three steps

Fit and predict were discussed in two previous Emerging Trends articles in this journal (Church et al. Reference Church, Chen and Ma2021a, b). This paper will unify much of that discussion into a single github (see footnote 1) with hundreds of examples of short (1-line) programs.Footnote 13

gft makes it easy to use models and data sets on hubs: HuggingFaceFootnote 14 and PaddleHub/PaddleNLP).Footnote 15 The hubs are large ( $\sim$ 40k models and $\sim$ 4k data sets) and growing quickly ( $\sim$ 3x/year). The challenge is to make these amazing resources more accessible to as many users as possible. The target audience has diverse interests and skills. It should not be necessary for them to know much (if any) programming to join in on the fun.

The 40k models include both pretrained and post-trained models, $f_{pre}$ and $f_{post}$ . gft provides tools to make it easy to find popular models, as well as popular data sets. We recommend users make as much use as possible of these resources and resist the temptation to pretrain their own models from scratch, for reasons that will be discussed in Appendix A.1.

3.1. An example of fit and predict in R

As mentioned above, gft is inspired by glm (general linear models) (Guisan et al., Reference Guisan, Edwards and Hastie2002) in R.Footnote 16 Listing 1 illustrates the use of fit and predict in R. The R environment provides a number of standard data sets such as cars, a data table with two columns, speed and dist, shown as black points in Figure 1. The model, g, fits dist as a quadratic function of speed. Predictions from this model are shown in red in Figure 1.

Listing 1. Example of fit and predict in R.

Figure 1. Results produced by Listing 1. Predictions from the model g are shown in red.

The summary function in R is applied to both the data table cars as well as the model g. The R summary function can be applied to almost any object and provides some useful description of its argument.

3.2. An example of fit (aka fine-tuning)

Listing 2 shows an example of gft_fit. Listing 2 is similar to Listing 1 in a number of ways. Fit takes a pretrained model, $f_{pre}$ , and uses a data set to output a post-trained model, $f_{post}$ . In Listing 2, $f_{pre}$ is a BERT model, and the data table is the emotion data set on HuggingFace. The model in Listing 1, g, is analogous to $f_{post}=\$outdir$ in Listing 2. The variables in both equations, line 7 of Listing 1 and line 3 of Listing 2, refer to columns in the relevant data table.

Listing 2. Example of gft_fit.

Many gft programs take four arguments:

  1. 1. –data specifies the use of the emotion data set on HuggingFace.Footnote 17

  2. 2. –model specifies the use of a BERT model on HuggingFaceFootnote 18 as $f_{pre}$ .

  3. 3. –eqn specifies a task (classification), plus a formula expressed in terms of columns in the data set.

  4. 4. –task specifies a task (not necessary when task is specified by –eqn argument).

Fit takes most of these (except for –task); in addition, fit requires –output_dir to specify a location for the output post-trained model, $f_{post}$ .

3.3. An example of predict (aka inference)

Listing 3. Example of gft_predict. The default model performs sentiment analysis.

Listing 4. Example of gft_predict with a model for emotion classification.

Listings 3 and 4 show two examples of gft_predict. Predict takes a novel input, x, and applies x to a model, f, to produce a prediction, $\hat{y}=f(x)$ . The default model (for the classification task) performs sentiment analysis; other models output other labels. In particular, the f in Listing 4 outputs emotion classes: anger, fear, joy, love, sadness, surprise. To see the set of classes for a model, we recommend the use of gft_summary, as illustrated in Listing 5. gft_summary outputs the set of classes, among other things.

Listing 5. Example of gft_summary.

Some more classifications of $x=$ I love you are shown in Tables 2 and 3 using a number of different models from HuggingFace. Most of these models agree that x is positive, though many of them classify x as fake news and some classify x as spam. One can use other models to classify x in many ways such as offensive or not and hate speech or not.

Table 2. Sentiment classification of $x=$ I love you

Table 3. More classifications of $x=$ I love you

Many of these classifiers were trained on corpora that may not be appropriate for this task. In particular, we really should not apply a Spanish classifier on English inputs, but mistakes like that are likely to happen given how easy it is to make such mistakes.

Most of the models on the hubs were created by the community. The hubs do not vet models for quality. The best models on the hubs are very good, though maybe not state of the art (SOTA). We rarely see results that are as good as PWCFootnote 19 and leaderboards.Footnote 20 Some models produce poor results, or no results (using standard mechanisms in gft). The most popular models (in terms of downloads) often produce competitive results, though the most popular models rarely produce the best results.

3.4. Embarrassment of riches

As mentioned at the beginning of this section, there are a huge number of models and data sets on the hubs. There are currently 40k models and 4k data sets, and these numbers are increasing rapidly ( $\sim$ 3x/year). How do we find the good stuff? And how do we use it?

The hubs provide a number of useful tools to answer these questions. There are GUI interfaces (as illustrated by footnotesFootnote 17 and Footnote 18), as well as APIs. gft_summary uses the APIs to provide much of this functionality, as illustrated in Listing 6, which finds the five most popular data sets (or models) that contain the substring: “emotion.” Popularity is estimated from downloads.

Listing 6. Example of gft_summary as a search engine.

Listing 7 finds the most popular data sets and models by searching for data sets and models that contain the null string:

Listing 7. Example of gft_summary with the null string as a query.

There are a few common naming conventions. Models containing the string “base” are likely to be base models, $f_{pre}$ (also known as pretrained models or foundation models). Models containing the string “distil” are like to be distilled (compressed models). Models containing the names of popular tasks such as “squad” and GLUE subtasks are likely to be post-trained models, $f_{post}$ .

gft_summary can also be used to summarize data sets, models, tasks, etc. As mentioned in Section 3.1, these summaries are modeled after the summary function in R, which takes many different types of objects and produces useful descriptions.

Listing 8. Examples of gft_summary.

3.5. Portability across hubs and frameworks

3.5.1. Portability $\rightarrow$ stability over time

The code in the listings above take a dependency on HuggingFace, a small start-up company that has done very well recently. There are also dependencies on a number of Python packages that are constantly changing. We have seen many hardware and software platforms come and go. Many companies do well for a while, but success rarely lasts for long (decades). Deep nets will be more likely to survive the test of time if they are written in high-level languages such as gft that can be ported from one environment to another, as necessary.

Consider the example of operating systems. Unix survived the test of time better than alternatives such as VMSFootnote 21 because Unix was designed to port easily across suppliers. There was a time when Unix was mostly running on DEC machines,Footnote 22 and then there was a time when Unix was mostly running on Sun computers.Footnote 23 These days, Unix has moved on to other platforms. If programs are written in a relatively stable higher level environment like Unix (and gft), then old programs are more likely to continue to work for decades, despite instabilities at lower levels in the hardware and software stacks.

Too many deep nets are taking dependencies on Python packages that are updated very frequently (almost daily), often in incompatible ways. Many of these resources are supported by companies that could go out of business, or could decide to sunset support at any time. Given recent events, there is a risk that support could also be cutoff by sanctions and other instabilities in international relations. Because of these realities, gft is designed to make it easy to port from one hub to another.

3.5.2. H is for HuggingFace and P is for PaddleNLP/PaddleHub

Listing 9 is similar to Listing 2, though dependencies on one company (H $\rightarrow$ HuggingFace) are replaced by dependencies on another company (P $\rightarrow$ Baidu’s PaddleNLP/PaddleHub). gft supports mixing and matching models and data sets from different suppliers. “H:” uses resources from Huggingface, and “P:” uses resources from PaddleNLP/PaddleHub. gft also supports “C:” for custom resources on the local file system.

Listing 9. An example of gft_fit using P for PaddleNLP/PaddleHub.

Note that most of the models on HuggingFace are based on PyTorch, whereas models on PaddleNLP and PaddleHub use a different framework called PaddlePaddle. gft hides much of this complexity.

Listing 9 uses the chnsenticorp data set,Footnote 24 which is different from the emotion dataset in Listing 2. The chnsenticorp data set specifies a sentiment analysis task in Chinese, whereas the emotion data set specifies an emotion classification task in English.

Listing 9 uses the ernie-tiny model (Su et al., Reference Su, Chen, Feng, Liu, Liu, Sun, Tian, Wu and Wang2021), a compressed version of an ERNIE model. ERNIE models are similar to BERT models, though ERNIE models may be more appropriate for Chinese applications. Distillation (Hinton et al., Reference Hinton, Vinyals and Dean2015) is a popular method to compress models. Compressed models tend to trade-off a little bit of performance (accuracy) in order to save a substantial amount of space and time when making predictions at inference time (Ganesh et al., Reference Ganesh, Chen, Lou, Khan, Yang, Sajjad, Nakov, Chen and Winslett2021). Distillation can be important for commercial applications.

4. Data sets and equations

4.1. Data sets

As mentioned in Section 3.4, there are currently more than 4000 data sets on the hubs. We have already mentioned the emotion data set. Many data sets provide splits for training, validation, and test, though different data sets may name these splits differently. Each split provides a data table with columns and rows. The emotion data set, for example, contains two columns, named text and label. As can be seen in HuggingFace’s data set viewer,Footnote 25 each row specifies a text field (e.g., “i didnt feel humiliated”) and a label field (e.g., “sadness”). We will refer to the label field as a gold label. The task is to predict the gold labels.

SQuADFootnote 26 $^,$ Footnote 27 (Rajpurkar et al., Reference Rajpurkar, Zhang, Lopyrev and Liang2016, Reference Rajpurkar, Jia and Liang2018) is a popular data set for question answering. This data set has 5 columns: id, title, context, question, answers. The answers are substrings of the context, which makes this task considerably easier than the general case of Q&A (question answering), where the answer could be almost anything, and need not be mentioned in any of the other columns.

In Section 2.1 of (Church and Kordoni, Reference Church and Kordoni2022), there is a discussion of constructed queries like SQuAD. The TREC QA trackFootnote 28 started with “constructed” questions in 1999, but quickly moved to “real” questions from query logs for subsequent TREC QA tracks (2000–2007) because constructed questions are too easy for systems and unrealistic (Voorhees, Reference Voorhees2001).

Another popular data set is GLUEFootnote 29 $^,$ Footnote 30 (Wang et al., Reference Wang, Singh, Michael, Hill, Levy and Bowman2018). GLUE contains a number of subsets: cola, sst2, wnli, mrpc, rte, qnli, qqp, sstb, mnli. Each subset contains 3 splits (train, validation, test). Different subsets have different columns.

GLUE has been updated with another task, SUPERGLUE (Wang et al., Reference Wang, Pruksachatkun, Nangia, Singh, Michael, Hill, Levy and Bowman2019). Both GLUE and SUPERGLUE are popular on HuggingFace (in terms of downloads), though there are currently more downloads for GLUE.Footnote 31

4.2. Examples of –data and –eqn

Short (1-line) gft programs can fit (fine-tune) many benchmarks, as illustrated in Table 4. Table 4 shows –data and –eqn arguments for a number of popular benchmarks.

  • data arguments start with a supplier, for example, H, P, C. After the colon, there can be one or two substrings, delimited by comma. For example, for the cola subtask of GLUE, the –data argument is H:glue,cola.

  • eqn arguments consist of a task, plus a formula expressed in terms of columns in the dataset. See Table 5 for some examples of some tasks. For a more comprehensive list of tasks, see footnote Footnote 11.

Table 4. Fine-tuning for downstream tasks: GLUE, SQuAD, etc.

Table 5. Some examples of tasks

5. More examples and more tasks

As mentioned in footnote Footnote 13, there are hundreds of examples of gft in the github: fit,Footnote 32 predict,Footnote 33 summary,Footnote 34 and eval.Footnote 35 A few examples have already been discussed in Sections 3.2 and 3.3. Many more will be discussed in the next few subsections:

  1. 1. Predict (Section 5.1): token-classification, fill-mask, MT, ASR, etc.

  2. 2. Input from datasets (as opposed to stdin) (Section 5.2).

  3. 3. gft_predict $\rightarrow$ gft_eval (Section 5.3).

5.1. Predict

A few examples of predict were shown in Listing 3. The gft documentation has many more examples of predict.Footnote 36

5.1.1. Token classification

Some examples of token classification with PaddleNLP are shown in Listing 11.

Listing 10. Example of token classification.

Listing 11. Example of token classification with PaddleNLP.

Many of these tasks have been in the literature for a long time. Fill-mask is similar to the cloze task (Taylor, Reference Taylor1953), as illustrated in Listing 12.

Listing 12. Example of fill-mask (also known as cloze task).

Text generation is one of the more popular use cases for GPT-3, though Listing 13 uses a different model.

Listing 13. Example of text generation.

5.1.2. MT, ASR and more

There are translation models for many language pairs, as illustrated in Listing 14.Footnote 37

Listing 14. Example of machine translation (MT).

Listing 15. Example of automatic speech recognition (ASR).

Listing 16. Example of image classification.

5.2. Input from data sets (as opposed to stdin)

Listing 17 shows an example of input from a data set.

Listing 17. Example of input from data set (as opposed to stdin).

5.3. gft_predict $\rightarrow$ gft_eval

Listing 18 illustrates gft_eval.

Listing 18. gft_eval outputs a single score for a data set, as opposed to gft_predict, which outputs a prediction for each row.

5.4. Debugging, confusion matrices, and error analysis

In addition to producing a score with gft_eval, suppose we want to do some deep dives to look at particular errors. The code in Listing 19 will create a confusion matrix based on the validation split.

Listing 19. Code to create confusion matrix.

gft_predict outputs TSV (tab separated values) with 4 columns:

  1. 1. Input, x

  2. 2. Gold label, y

  3. 3. Predicted label, $\hat{y}$

  4. 4. Score

The cut statement on line 4 in Listing 19 selects y and $\hat{y}$ . The sort and uniq statements count the number of confusions, producing the confusion matrix shown in Table 6. Standard Unix tools such as grep (or AWK) can be used to find more details for particular confusions.

Table 6. Confusion matrix from Listing 19

Table 7. Some gold labels and predictions from model, $f_{post}$ , from Listing 20

5.5. Vectors on the left hand side (LHS)

With regression and classification, the left-hand side (lhs) of the equation is typically a scalar, but gft has been generalized so the lhs can also be a point in a vector space, as shown in Listing 20. This example fine-tunes BERT with the NRC-VAD lexiconFootnote 38 (Mohammad, Reference Mohammad2018). Words are assigned to points in $\mathbb{R}^3$ , Valance, Arousal, and Dominance, based on VAD norms in psychology (Osgood et al., Reference Osgood, Suci and Tannenbaum1957).

Listing 20. An equation with a vector on the left-hand side (lhs).

Listing 20 is our first example of a custom data set. There are three CSV files on the local filesystem:

  1. 1. train split: $gft/datasets/VAD/VAD.train

  2. 2. validation split: $gft/datasets/VAD/VAD.val

  3. 3. test split: $gft/datasets/VAD/VAD.test

The three CSV files start with a header row that specifies the names of the columns. The variables in the equation refer to these columns in the CSV files.

In addition to illustrating the use of custom data sets, Listing 20 introduces two new features. First, we normally train models on corpora, but Listing 20 trains a model on a lexicon, the NRC-VAD lexicon. Second, regression usually takes scalar values on the left-hand side (lhs), but in this case, the lhs is a point in $\mathbb{R}^3$ .

Listing 20 produces a post-trained model $f_{post}$ . A few results with $f_{post}$ are shown in Table 7. This table shows some predictions, $\hat{y}$ , for some inputs, x, using $f_{post}$ . These predictions, $\hat{h}$ , can be compared with gold labels, y, VAD scores from NRC-VAD (last three columns).

Although the model was trained on words (lemmas in the NRC Lexicon), the inputs, x, in Table 7 include a number of words, phrases, and texts, many of which are not in the NRC-VAD Lexicon (by construction). That is, $f_{post}$ can be applied to any input text (up to 512 subword units). Table 7 shows predictions, $\hat{V}$ , $\hat{A}$ , and $\hat{D}$ , as well as gold values, V, A, and D. When the input, x, is not in the NRC-Lexicon, the gold value, y, is NA (not available). Since NRC-VAD is based on lemmas, NAs are to be expected for inflected forms, OOVs (out-of-vocabulary) words such as unlovable, MWEs (multiword expressions) such as ugly duckling, sentences, documents.

6. Conclusions

This paper proposed gft, a little language for fine-tuning pretrained base (foundation) models. Little languages make it easier for a broader audience (including non-programmers) to join in on the fun. Just as most users of regression do not need to know how to solve the regression optimization, so too users of deep nets should not need to understand hundreds of lines of Python and PyTorch. Higher level environments offer a number of advantages: ease of use, transparency, portability. gft removes much of the complexity, and much of the magic (and the alchemy) in deep nets, reducing fine-tuning to an optimization similar to regression. No one would suggest that regression-like methods are “intelligent.”

A Appendix

A.1 Pretraining ( $f_{pre}$ ): Don’t do it (yourself)

Recent work on foundation modelsFootnote 39 (Bommasani et al., Reference Bommasani, Hudson, Adeli, Altman, Arora, von Arx, Bernstein, Bohg, Bosselut, Brunskill, Brynjolfsson, Buch, Card, Castellon, Chatterji, Chen, Creel, Davis, Demszky, Donahue, Doumbouya, Durmus, Ermon, Etchemendy, Ethayarajh, Fei-Fei, Finn, Gale, Gillespie, Goel, Goodman, Grossman, Guha, Hashimoto, Henderson, Hewitt, Ho, Hong, Hsu, Huang, Icard, Jain, Jurafsky, Kalluri, Karamcheti, Keeling, Khani, Khattab, Kohd, Krass, Krishna, Kuditipudi, Kumar, Ladhak, Lee, Lee, Leskovec, Levent, Li, Li, Ma, Malik, Manning, Mirchandani, Mitchell, Munyikwa, Nair, Narayan, Narayanan, Newman, Nie, Niebles, Nilforoshan, Nyarko, Ogut, Orr, Papadimitriou, Park, Piech, Portelance, Potts, Raghunathan, Reich, Ren, Rong, Roohani, Ruiz, Ryan, Ré, Sadigh, Sagawa, Santhanam, Shih, Srinivasan, Tamkin, Taori, Thomas, Tramèr, Wang, Wang, Wu, Wu, Wu, Xie, Yasunaga, You, Zaharia, Zhang, Zhang, Zhang, Zhang, Zheng, Zhou and Liang2021) attempts to compete with industry on what industry does best. We think this is a mistake. Industry has “unfair” advantagesFootnote 40 on tasks like pretraining $f_{pre}$ , which require large investments in people and machines, as shown in Table 8.

We recommend that academics focus on fit and predict, which are much more affordable than pretraining $f_{pre}$ . The last two columns in Table 8, time and hardware, obviously depend on many factors such as the size of the model. One of the motivations behind distillation (Hinton et al., Reference Hinton, Vinyals and Dean2015; Ganesh et al., Reference Ganesh, Chen, Lou, Khan, Yang, Sajjad, Nakov, Chen and Winslett2021) is to reduce the size of the model. Smaller models tend to run faster at inference time. While inference times are relatively faster than training times, inference time is often a bottleneck for commercial applications since training is a one-time investment, whereas inference is a recurring cost. For successful applications with millions or billions of users, recurring costs can easily dominate one-time training costs.

As for training costs, pretraining is much more expensive than fine-tuning, especially for large models. Pretraining is already very expensive and will become even more expensive in the future as models become larger and larger. Pretraining large models will be beyond the means of academics (and governments).

Consider the pretrained models in Table 9, and especially the largest model, PaLM (Chowdhery et al., Reference Chowdhery, Narang, Devlin, Bosma, Mishra, Roberts, Barham, Chung, Sutton and Gehrmann2022). PaLM produces impressive results, using a huge model (540B parameters). That said, the size of the investment is even more impressive: the paper has dozens of authors using thousands of TPUs (distributed over multiple data centers).

When the investments are this large, projects become risk adverse. Projects of this size cannot afford to fail. Academics should focus on projects that reward creativity and avoid projects that are too big to fail.

We like to think of $f_{pre}$ like Intel CPU chips. Universities can afford to program CPUs, but universities cannot afford to compete with Intel and fabricate their own CPUs. So too, we argue that universities can afford to fit and predict deep nets, but they cannot afford to compete with industry on $f_{pre}$ . When the first author was a student at MIT, his thesis advisor, Jon Allen, urged the university to make large investments in VLSI fabrication. In retrospect, it was probably a mistake for a university to invest in VLSI fabrication, though others may disagree with that assessment.Footnote 41

In short, we recommend users start by downloading $f_{pre}$ from hubs and focus on steps 2 (fit) and 3 (predict) of the standard recipe. Some examples of $f_{pre}$ are shown in Table 9. Many of these models can be downloaded from hubs, with a few exceptions, especially for larger models such as ERNIE 3.0, GPT-3, PaLM. Most models are trained on corpora, as shown in Table 10.

Table 8. Most $f_{pre}$ models are trained in industry because pretraining requires large capital investments in large teams and GPU clusters

Table 9. gft starts with large pre-trained base models, $f_{pre}$ , typically trained on large corpora in Table 10, using expensive GPU clusters

Table 10. Some popular corpora for training pre-trained models, $f_{pre}$

Footnotes

2 The term, little languages, is borrowed from Unix (Bentley, Reference Bentley1986). Programs in little languages such as AWK (Aho et al., Reference Aho, Kernighan and Weinberger1987) are short (often just a single line of code) and powerful.

11 Currently supported tasks are: classify (aka text-classification), classify_tokens (aka token-classification), classify_spans (aka QA, question-answering), classify_images (aka image-classification), classify_audio (aka audio-classification), regress, text-generation, MT (aka translation), ASR (aka ctc, automatic-speech-recognition), fill-mask. Tasks in parentheses are aliases.

12 In addition to history, there are two more reasons to prefer the terms, fit and predict. First, the proposed terminology, as mentioned above, demystifies deep nets. No one would suggest that regression-like methods are “intelligent.” Second, the proposed terminology is intended to discourage work on foundation models, $f_{pre}$ . As will be discussed in Appendix A.1, the term, foundation models, was introduced to encourage work on $f_{pre}$ (Bommasani et al., Reference Bommasani, Hudson, Adeli, Altman, Arora, von Arx, Bernstein, Bohg, Bosselut, Brunskill, Brynjolfsson, Buch, Card, Castellon, Chatterji, Chen, Creel, Davis, Demszky, Donahue, Doumbouya, Durmus, Ermon, Etchemendy, Ethayarajh, Fei-Fei, Finn, Gale, Gillespie, Goel, Goodman, Grossman, Guha, Hashimoto, Henderson, Hewitt, Ho, Hong, Hsu, Huang, Icard, Jain, Jurafsky, Kalluri, Karamcheti, Keeling, Khani, Khattab, Kohd, Krass, Krishna, Kuditipudi, Kumar, Ladhak, Lee, Lee, Leskovec, Levent, Li, Li, Ma, Malik, Manning, Mirchandani, Mitchell, Munyikwa, Nair, Narayan, Narayanan, Newman, Nie, Niebles, Nilforoshan, Nyarko, Ogut, Orr, Papadimitriou, Park, Piech, Portelance, Potts, Raghunathan, Reich, Ren, Rong, Roohani, Ruiz, Ryan, Ré, Sadigh, Sagawa, Santhanam, Shih, Srinivasan, Tamkin, Taori, Thomas, Tramèr, Wang, Wang, Wu, Wu, Wu, Xie, Yasunaga, You, Zaharia, Zhang, Zhang, Zhang, Zhang, Zheng, Zhou and Liang2021), but we believe it is a mistake for academics to compete with industry on tasks that require large investments, and more logistics and systems work, than creative contributions to computational linguistics research.

40 “Unfair advantages” is management jargon, common in industry, especially when discussing strategy. Obviously, there is nothing “unfair” about taking advantage of one’s strengths.

References

Aho, A.V., Kernighan, B.W. and Weinberger, P.J. (1987). The AWK Programming Language. Addison-Wesley Longman Publishing Co., Inc.Google Scholar
Baevski, A., Zhou, H., Mohamed, A. and Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477.Google Scholar
Bentley, J.L. (1986). Little languages. Communications of the ACM 29(8), 711721.CrossRefGoogle Scholar
Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J.Q., Demszky, D., Donahue, C., Doumbouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C., Gale, T., Gillespie, L., Goel, K., Goodman, N., Grossman, S., Guha, N., Hashimoto, T., Henderson, P., Hewitt, J., Ho, D.E., Hong, J., Hsu, K., Huang, J., Icard, T., Jain, S., Jurafsky, D., Kalluri, P., Karamcheti, S., Keeling, G., Khani, F., Khattab, O., Kohd, P.W., Krass, M., Krishna, R., Kuditipudi, R., Kumar, A., Ladhak, F., Lee, M., Lee, T., Leskovec, J., Levent, I., Li, X.L., Li, X., Ma, T., Malik, A., Manning, C.D., Mirchandani, S., Mitchell, E., Munyikwa, Z., Nair, S., Narayan, A., Narayanan, D., Newman, B., Nie, A., Niebles, J.C., Nilforoshan, H., Nyarko, J., Ogut, G., Orr, L., Papadimitriou, I., Park, J.S., Piech, C., Portelance, E., Potts, C., Raghunathan, A., Reich, R., Ren, H., Rong, F., Roohani, Y., Ruiz, C., Ryan, J., , C., Sadigh, D., Sagawa, S., Santhanam, K., Shih, A., Srinivasan, K., Tamkin, A., Taori, R., Thomas, A.W., Tramèr, F., Wang, R.E., Wang, W., Wu, B., Wu, J., Wu, Y., Xie, S.M., Yasunaga, M., You, J., Zaharia, M., Zhang, M., Zhang, T., Zhang, X., Zhang, Y., Zheng, L., Zhou, K. and Liang, P. (2021). On the opportunities and risks of foundation models.Google Scholar
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I. and Amodei, D. (2020). Language models are few-shot learners. NeurIPS.Google Scholar
Buck, C., Heafield, K. and van Ooyen, B. (2014). N-gram counts and language models from the common crawl. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland: European Languages Resources Association (ELRA), pp. 35793584.Google Scholar
Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P. and Robinson, T. (2013). One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.Google Scholar
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al. (2022). PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.Google Scholar
Church, K., Chen, Z. and Ma, Y. (2021a). Emerging trends: A gentle introduction to fine-tuning. Natural Language Engineering 27(6), 763778.CrossRefGoogle Scholar
Church, K.W. and Kordoni, V. (2022). Emerging trends: Sota-chasing. Natural Language Engineering 28(2), 249269.CrossRefGoogle Scholar
Church, K.W., Yuan, X., Guo, S., Wu, Z., Yang, Y. and Chen, Z. (2021b). Emerging trends: Deep nets for poets. Natural Language Engineering 27(5), 631645.CrossRefGoogle Scholar
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L. and Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116.Google Scholar
Dale, R. (2021). GPT-3: What’s it good for? Natural Language Engineering 27(1), 113118.CrossRefGoogle Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 248255.CrossRefGoogle Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota: Association for Computational Linguistics, pp. 41714186.Google Scholar
Du, J., Na, X., Liu, X. and Bu, H. (2018). Aishell-2: Transforming mandarin asr research into industrial scale. arXiv preprint arXiv:1808.10583.Google Scholar
Ganesh, P., Chen, Y., Lou, X., Khan, M.A., Yang, Y., Sajjad, H., Nakov, P., Chen, D. and Winslett, M. (2021). Compressing large-scale transformer-based models: A case study on BERT. Transactions of the Association for Computational Linguistics 9, 10611080.CrossRefGoogle Scholar
Guisan, A., Edwards, T.C. Jr and Hastie, T. (2002). Generalized linear and generalized additive models in studies of species distributions: Setting the scene. Ecological Modelling 157(2–3), 89100.CrossRefGoogle Scholar
He, K., Zhang, X., Ren, S. and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770778.CrossRefGoogle Scholar
Hinton, G., Vinyals, O. and Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.Google Scholar
Howard, J. and Ruder, S. (2018). Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia: Association for Computational Linguistics, pp. 328339.CrossRefGoogle Scholar
Ito, K. and Johnson, L. (2017). The LJ speech dataset. https://keithito.com/LJ-Speech-Dataset/ Google Scholar
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach.Google Scholar
Mohammad, S. (2018). Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia: Association for Computational Linguistics, pp. 174184.CrossRefGoogle Scholar
Moore, G.A. and McKenna, R. (1999). Crossing the Chasm. Capstone Oxford.Google Scholar
Osgood, C.E., Suci, G.J. and Tannenbaum, P.H. (1957). The Measurement of Meaning. vol. 47. University of Illinois press.Google Scholar
Panayotov, V., Chen, G., Povey, D. and Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 52065210.CrossRefGoogle Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. and Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog.Google Scholar
Rajpurkar, P., Jia, R. and Liang, P. (2018). Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia: Association for Computational Linguistics, pp. 784789.CrossRefGoogle Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K. and Liang, P. (2016). SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas: Association for Computational Linguistics, pp. 23832392.CrossRefGoogle Scholar
Su, W., Chen, X., Feng, S., Liu, J., Liu, W., Sun, Y., Tian, H., Wu, H. and Wang, H. (2021). Ernie-tiny: A progressive distillation framework for pretrained transformer compression. arXiv preprint arXiv:2106.02241.Google Scholar
Sun, Y., Wang, S., Feng, S., Ding, S., Pang, C., Shang, J., Liu, J., Chen, X., Zhao, Y., Lu, Y., et al. (2021). Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137.Google Scholar
Sun, Y., Wang, S., Li, Y., Feng, S., Tian, H., Wu, H. and Wang, H. (2020). Ernie 2.0: A continual pre-training framework for language understanding. AAAI.Google Scholar
Taylor, W.L. (1953). “Cloze procedure”: A new tool for measuring readability. Journalism Quarterly 30(4), 415433.CrossRefGoogle Scholar
Voorhees, E.M. (2001). The TREC question answering track. Natural Language Engineering 7(4), 361378.CrossRefGoogle Scholar
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O. and Bowman, S. (2019). Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in Neural Information Processing Systems 32.Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O. and Bowman, S. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium: Association for Computational Linguistics, pp. 353355.CrossRefGoogle Scholar
Wu, B., Xu, C., Dai, X., Wan, A., Zhang, P., Yan, Z., Tomizuka, M., Gonzalez, J., Keutzer, K. and Vajda, P. (2020). Visual transformers: Token-based image representation and processing for computer vision.Google Scholar
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A. and Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1927.CrossRefGoogle Scholar
Figure 0

Table 1. The standard recipe consists of three steps

Figure 1

Listing 1. Example of fit and predict in R.

Figure 2

Figure 1. Results produced by Listing 1. Predictions from the model g are shown in red.

Figure 3

Listing 2. Example of gft_fit.

Figure 4

Listing 3. Example of gft_predict. The default model performs sentiment analysis.

Figure 5

Listing 4. Example of gft_predict with a model for emotion classification.

Figure 6

Listing 5. Example of gft_summary.

Figure 7

Table 2. Sentiment classification of $x=$I love you

Figure 8

Table 3. More classifications of $x=$I love you

Figure 9

Listing 6. Example of gft_summary as a search engine.

Figure 10

Listing 7. Example of gft_summary with the null string as a query.

Figure 11

Listing 8. Examples of gft_summary.

Figure 12

Listing 9. An example of gft_fit using P for PaddleNLP/PaddleHub.

Figure 13

Table 4. Fine-tuning for downstream tasks: GLUE, SQuAD, etc.

Figure 14

Table 5. Some examples of tasks

Figure 15

Listing 10. Example of token classification.

Figure 16

Listing 11. Example of token classification with PaddleNLP.

Figure 17

Listing 12. Example of fill-mask (also known as cloze task).

Figure 18

Listing 13. Example of text generation.

Figure 19

Listing 14. Example of machine translation (MT).

Figure 20

Listing 15. Example of automatic speech recognition (ASR).

Figure 21

Listing 16. Example of image classification.

Figure 22

Listing 17. Example of input from data set (as opposed to stdin).

Figure 23

Listing 18. gft_eval outputs a single score for a data set, as opposed to gft_predict, which outputs a prediction for each row.

Figure 24

Listing 19. Code to create confusion matrix.

Figure 25

Table 6. Confusion matrix from Listing 19

Figure 26

Table 7. Some gold labels and predictions from model, $f_{post}$, from Listing 20

Figure 27

Listing 20. An equation with a vector on the left-hand side (lhs).

Figure 28

Table 8. Most $f_{pre}$ models are trained in industry because pretraining requires large capital investments in large teams and GPU clusters

Figure 29

Table 9. gft starts with large pre-trained base models, $f_{pre}$, typically trained on large corpora in Table 10, using expensive GPU clusters

Figure 30

Table 10. Some popular corpora for training pre-trained models, $f_{pre}$