ChatGPT vs small models : can Goliath help David ?
The work journey of an NLP specialist is full of challenges. One of the main challenges that all specialists in the field may face is a lack of data. Our customers sometimes have a shortage of data, low-quality data, or no data at all to achieve good model performance.
The second challenge that may occur is the size of our models. We sometimes need a small language model that doesn't take up much resources but can provide us with equal or close-to performance compared to a larger language model.
With these two challenges in mind, we decided to explore how we could leverage a Large Language Model such as ChatGPT to help train a smaller model to a specific task.
Our team decided to compare the performances of two large pre-trained models and several smaller models trained with a fully (100%) or partially (90%) synthetic dataset in French, generated by ChatGPT.
We also wanted to assess the added value of CoT (Chain of Thought), where ChatGPT is not only generating data but also the verbalized rational to use or interpret this data : in our case, the explanation of a choice of label. The CoT has proven valuable in the context of Large Language Models, we wanted to assess whether CoT can also improve small model performances.
We chose French because there are less quality dataset and pretrained models in French than in English, so the need for such an optimized approach is more frequent.
One of the popular requests in NLP is the creation of a conversational agent to respond to clients’ needs more quickly. To accomplish this type of project, it is necessary to detect clients’ intentions in order to provide them with the required information.
In order to deal with such a task, we have chosen a dataset containing a series of sentences and their classification label, provided by Husain Khatba and published in Kaggle. We have selected a portion of data related to tourism because it is a fairly popular topic for customer support.
We selected the following labels :
réservation de restaurant (restaurant booking)
bagages perdus (lost luggage)
annuler une réservation (booking cancelation)
visa international (international visa)
suggestion de voyage (travelling recommendations)
transfert bancaire ( bank transfer)
réserver un vol (flight booking)
location de voiture (car renting)
réserver un hôtel (hotel booking)
état du vol (flight status information)
Data Generation Step
Subsequently, we created two datasets of customer questions, and their respective labels above. The first one is 100% synthetic, and the second one is composed of 90% synthetic data derived from 10% reference data coming from the Kaggle dataset (train set). This was done to observe whether data augmentation could improve the performance of the models versus pure synthetic data generation. This data augmentation was handled by asking ChatGPT to generate variants of requests from the ones contained in the 10% reference dataset.
Furthermore, with the aim of improving the performance of the small models, we decided to generate not only the examples and their labels but also the explanations justifying the choice of each label (Chain Of Thought).
The notebook that allowed us to generate these two datasets is available in 
We trained three models of different size and complexity and compared their performances.
The smaller models we chose were Flaubert (Unsupervised Language Model Pre-training for French) and SVM (Support Vector Machines). We fine-tuned Flaubert using our data (with and without explanations), and we trained SVM from scratch (with and without explanations) in order to compare the results with those of the large models.
Our GitHub repository includes the detailed description of the model training process.
To compare the models' performances, our team chose 4 metrics: accuracy, precision, recall, and F1 score. The two large models (ChatGPT and zero-shot Bert-type) were evaluated on the test data, while the other models were trained on our generated data (with and without explanations) and then evaluated on the same test data. The table below shows the performances obtained with the different models:
Table 1 - metrics comparison
For this classification test, the leader remains ChatGPT with a score of 96%. However, it is interesting to note that our smaller models, trained with the 90% synthetic dataset, are very close to the leader, with a score of 93% for SVM and 94% for Flaubert (without a CoT approach), while the zero-shot model showed the worst results in our selection.
The performance of the models trained with explanations is slightly lower compared to the models without explanations, but the difference is not significant.
For the 100% synthetic dataset, we observe a different trend: explanations help the models improve performance, and they even manage to achieve results comparable to the zero-shot model. However, this performance is still far from the one obtained with semi-synthetic data.
This experiment shows that artificial data generation from a small subset significantly improves the performance of trained models and brings them close to the performance of very large language models. Ideally, one should elaborate synthetic data from a subset of real case data, as it often can be the case in real life NLP applications.
A major advantage of the trained models described above is that they use much less space (see Table 2): inference can be run locally at low computation expense and higher speed (see Table 3).
Table 2 - Model size
Table 3 - Model inference time (in sec)
On the other end, the Chain of Thought approach that we have tested during our experiment proved to be beneficial in cases where only synthetic training data is used. In the absence of real training data, the CoT approach can contribute to improving model performance.
* Follow us on LinkedIn for next blog updates:
* Interested in our skills? Let's discuss your projects together:
* Our public Github repository: