ClimateBert

What is ClimateBert?

ClimateBert is the name of both our large language model (LLM) and our ensemble of downstream task models. They were all developed in a series of research papers by the author team of Julia Anna Bingler, Mathias Kraus, Markus Leippold, and Nicolas Webersinke.

The ClimateBERT Large Language Model

Using the DistilRoBERTa model as starting point, the ClimateBERT language model is additionally pretrained on a text corpus comprising climate-related research paper abstracts, corporate and general news and reports from companies.

The pretrained climate-domain-adapted language models with masked language model head are publicly available on the 🤗 Hugging Face Hub:

ClimateBERT_F
https://huggingface.co/climatebert/distilroberta-base-climate-f
ClimateBERT_S
https://huggingface.co/climatebert/distilroberta-base-climate-s
ClimateBERT_D
https://huggingface.co/climatebert/distilroberta-base-climate-d
ClimateBERT_D+S
https://huggingface.co/climatebert/distilroberta-base-climate-d-s

Note: We generally recommend choosing the ClimateBERT_f language model over those based on the other sample selection strategies (unless you have good reasons not to). This is also the only language model we will update from time to time.

The underlying methodology can be found in our language model research paper: CLIMATEBERT: A Pretrained Language Model for Climate-Related Text

If you use our LLM, please cite:

@inproceedings{wkbl2022climatebert, title = {{ClimateBERT: A Pretrained Language Model for Climate-Related Text}}, author={Webersinke, Nicolas and Kraus, Mathias and Bingler, Julia and Leippold, Markus}, booktitle={Proceedings of AAAI 2022 Fall Symposium: The Role of AI in Responding to Climate Challenges}, year={2022}, doi={https://doi.org/10.48550/arXiv.2212.13631}, }

The ClimateBERT downstream task models

So far, ClimateBERT has been fine-tuned on five downstream tasks, for which the models with classification head are publicly available on the 🤗 Hugging Face Hub. It is able to:

detect climate-related paragraphs
https://huggingface.co/climatebert/distilroberta-base-climate-detector
classify the climate-related sentiment in climate-related paragraphs
https://huggingface.co/climatebert/distilroberta-base-climate-sentiment
identify whether or not a given climate-related paragraph is about climate-related commitments and actions
https://huggingface.co/climatebert/distilroberta-base-climate-commitment
identify whether a climate-related paragraph is specific or non-specific
https://huggingface.co/climatebert/distilroberta-base-climate-specificity
to assign a climate disclosure category to climate-related paragraphs based on the four categories of the recommendations of the Task Force on Climate-related Financial Disclosures (TCFD)
https://huggingface.co/climatebert/distilroberta-base-climate-tcfd

The additional downstream tasks that ClimateBERT is fine-tuned on could serve various use cases. For example, it could aid financial supervisors in assessing the state of corporate climate risk disclosures. Or it could support supervisory agencies and other stakeholders in their recent activities to detect corporate greenwashing activities. Financial analysts might use ClimateBERT to identify a company's climate risks and opportunities, and assess the specificity of their climate-related claims. The outputs from the various classification tasks can also be combined into further indicators, such as a cheap talk index by combining the commitment and specificity results.

The underlying methodology can be found in our research paper: How Cheap Talk in Climate Disclosures relates to Climate Initiatives, Corporate Emissions, and Reputation Risk

If you use our fine-tuned models, please cite:

@techreport{bingler2023cheaptalk, title={How Cheap Talk in Climate Disclosures Relates to Climate Initiatives, Corporate Emissions, and Reputation Risk}, author={Bingler, Julia and Kraus, Mathias and Leippold, Markus and Webersinke, Nicolas}, type={Working paper}, institution={Available at SSRN 4000708}, year={2023} }

The ClimateBert corresponding datasets

Except for the text corpus used to additionally pretrain our LLM, all of our corresponding datasets are also available on the 🤗 Hugging Face Hub:

The climate_detection dataset
https://huggingface.co/datasets/climatebert/climate_detection
The climate_sentiment dataset:
https://huggingface.co/datasets/climatebert/climate_sentiment
The climate_commitments_actions dataset:
https://huggingface.co/datasets/climatebert/climate_commitments_actions
The climate_specificity dataset:
https://huggingface.co/datasets/climatebert/climate_specificity
The tcfd_recommendations dataset:
https://huggingface.co/datasets/climatebert/tcfd_recommendations

Note: The text corpus used to additionally pretrain our LLM contains proprietary data that we are unfortunately not allowed to share

The data annotation process and more information about the datasets can be found in our research papers:

CLIMATEBERT: A Pretrained Language Model for Climate-Related Text
https://arxiv.org/pdf/2110.12010.pdf
Cheap talk and cherry-picking: What ClimateBert has to say on corporate climate risk disclosures
https://www.sciencedirect.com/science/article/pii/S1544612322000897
How Cheap Talk in Climate Disclosures relates to Climate Initiatives, Corporate Emissions, and Reputation Risk
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4000708

If you use our datasets, please cite:

The ClimateBert climate performance model card

The table shows our climate performance model card, following Hershcovich et al. (2022).

Model publicly available: Yes
Time to train final model: 48 hours
Time for all experiments: 350 hours
Power of GPU and CPU: 0.7 kW
Location for computations: Germany
Energy mix at location: 470 gCO₂eq/kWh
CO₂eq for final model: 15.79 kg
CO₂eq for all experiments: 115.15 kg
Average CO₂eq for inference per sample: 0.62 mg