Joint contrastive learning and belief rule base for named entity recognition in cybersecurity

Named Entity Recognition (NER) in cybersecurity is crucial for mining information during cybersecurity incidents. Current methods rely on pre-trained models for rich semantic text embeddings, but the challenge of anisotropy may affect subsequent encoding quality. Additionally, existing models may struggle with noise detection. To address these issues, we propose JCLB, a novel model that Joins Contrastive Learning and Belief rule base for NER in cybersecurity. JCLB utilizes contrastive learning to enhance similarity in the vector space between token sequence representations of entities in the same category. A Belief Rule Base (BRB) is developed using regexes to ensure accurate entity identification, particularly for fixed-format phrases lacking semantics. Moreover, a Distributed Constraint Covariance Matrix Adaptation Evolution Strategy (D-CMA-ES) algorithm is introduced for BRB parameter optimization. Experimental results demonstrate that JCLB, with the D-CMA-ES algorithm, significantly improves NER accuracy in cybersecurity.


Introduction
As cybercrimes and cyber-espionage incidents continue to escalate, cybersecurity has gained increasing significance for individuals, businesses, and governments (Ashraf et al. 2023).In the event of a cybersecurity incident, analysts need to swiftly identify entities from diverse incident logs, sourced from host log data, cyber traffic data, security alarm data, and threat intelligence data.These entities impact the cybersecurity situation, yet they are not directly observable in the actual cyber environment.Instead, they manifest within various cybersecurity events.To respond efficiently and effectively to cybersecurity incidents, it is essential to model and recognize entities across a vast array of cybersecurity data.With the development of Named Entity Recognition (NER), neural networks have been applied to entity extraction in the cybersecurity field (Gao et al. 2021).Whether utilizing pre-trained models in the representation process or employing encoders in the encoding process, these approaches allow for a comprehensive consideration of the contextual influence on each word.
However, there are still challenges within NER for cybersecurity data.Firstly, embeddings derived from pre-trained language models such as BERT often exhibit excessive clustering and uneven distribution in vector space (Gao et al. 2021).This phenomenon can lead to semantically similar tokens or token sequences being positioned further apart, while semantically unrelated tokens or sequences may end up with closely aligned vectors.The suboptimal representation of semantic similarity can skew the model's ability to accurately identify entities, potentially impacting its overall performance by favoring certain directional biases.Furthermore, present methods exhibit a deficiency when it comes to ensuring the accuracy of entity recognition.There seems to be a significant amount of noise in cybersecurity data.For instance, an IP address that appears in the text may be incorrect either in format or in content.Despite this, current models continue to label such instances as IP addresses, demonstrating a lack of judgment on their validity.
In this paper, we propose JCLB, which Joins Contrastive Learning and Belief rule base, designed for NER in cybersecurity.Inspired by the successful application of contrastive learning in text clustering (Hu et al. 2024), we use contrastive learning to fine-tune BERT, with the purpose of closely aligning token sequence representations for the same type of entities in vector space, while keeping them distinct from those of other token sequences.Specifically, we devise objectives based on span and position to enhance the representation similarity of both token sequence and tokens at the boundary for entities of the same type in the vector space.Additionally, to effectively filter noise and discern entity correctness, we establish regexes as rules.We learn the confidence of each rule to create a Belief Rule Base (BRB), which filters entity categories and simultaneously assesses their correctness.The BRB mitigates potential errors associated with relying solely on regexes.Furthermore, while the Covariance Matrix Adaptation Evolutionary Strategies (CMA-ES) algorithm is a robust optimization algorithm for BRB, it may not perform optimally for larger-scale or high-dimensional optimization problems (Hansen 2006;Yao et al. 2004).To address these challenges, we propose the Distributed CMA-ES (D-CMA-ES) algorithm that divides the high-dimensional search space into various subspaces with relatively lower dimensions and uses the CMA-ES algorithm to search in these subspaces.Finally, the solutions in the low-dimensional subspaces are combined to obtain the solution to the original problem.
Our contributions are as follows.
• We apply contrastive learning to fine-tune BERT, enhancing the similarity of the same type of entities in the vector space.• We establish a BRB combining qualitative information with its capacity to define various types of uncertain information to filter noise and verify entity accuracy.Additionally, we develop the D-CMA-ES algorithm to address the high dimensions in the parameter optimization of the BRB.• We conduct extensive experiments on two cybersecurity datasets, and the experimental results demonstrate the superiority of JCLB over existing models.
In the rest of the paper, we cover the related work in section "Related work" and then present the JCLB in section "Methodology".After reporting the experimental study in section "Experiments".We finally conclude our work in section "Conclusion".

Traditional methods for NER in cybersecurity
Early NER approaches primarily fall into two categories: rule-based and statistical machine-learning models.
Rule-based methods rely on expert-crafted rules, incorporating gazetteers and syntactic lexical patterns (Etzioni et al. 2005;Bridges et al. 2017).Statistical approaches leverage machine learning algorithms such as Hidden Markov Models (Morwal et al. 2012), Support Vector Machines (Mansouri et al. 2008), Perceptrons (Jin et al. 2020), and Conditional Random Fields (CRFs) (Joshi et al. 2013;Jia et al. 2018).Mulwad et al. (2011) extracted specific vulnerabilities and attack knowledge from Wikipedia, generating machine-understandable assertions but did not consider temporal factors.Lal (2013) trained a model using Stanford NER's Conditional Random Fields, automating and enhancing zero-day attack security, yet its performance is limited on cybersecurity data.Weerawardhana et al. (2015) proposed a machine learning and part-of-speech tagging strategy to extract intelligence from online vulnerability databases.

Neural networks for NER in cybersecurity
In recent years, deep neural networks have been considered potential alternatives to traditional NER methods due to the rapid development of deep learning (Altalhi and Gutub 2021;Kashihara et al. 2022;Zhu et al. 2021;Zhang et al. 2022).Collobert et al. (2011) proposed a neural network architecture and learning algorithm that reduces reliance on prior NLP knowledge, albeit with only moderate improvements in feature representation.Huang et al. (2015) integrated BiLSTM and CRF, effectively performing sequence labeling tasks, and establishing the dominance of RNN-based sequence models in NER tasks.Kim et al. (2020)  designed an open-source Python library named CyNER, using transformer-based models and heuristic methods to extract cybersecurity-related entities and Indicators of Compromise (IOC).This framework offers good portability and scalability while providing multiple trained models.
In this paper, we introduce contrastive learning into NER in Cybersecurity to bring the span representations for similar entities closer in the embedding space.Additionally, we utilize BRB to mitigate the impact of noisy entities.

Methodology
We begin by offering an overview of JCLB, along with an illustration of the framework in Fig. 1.Sentences are initially transformed into embedding matrices via BERT Fig. 1 The overview of the JCLB (Section "Initialize embedding").In this process, we employ contrastive learning to fine-tune BERT (Section "Contrastive learning for NER").Specifically, we obtain span representations for entities in each sentence.We then generate prototypes of span, the initial token, and the final token representations for the same type of entity in a mini-batch.Based on that, we introduce three objectives via contrastive learning for NER.Then, we use BiL-STM to splice the forward and backward hidden vectors, allowing for better long-distance bidirectional semantic dependency capture (Section "BiLSTM layer").To enable JCLB to selectively focus on more crucial parts in the input sequence when there is noise, we also introduce the MS (Section "Multi-head self-attention layer").The CRF model is then used to predict the likelihood of each token belonging to various labels (Section "CRF layer").Finally, a BRB is implemented to filter out inaccurately recognized entities, enhancing the precision of recognizing cybersecurity entities, particularly those typed with fixed-format phrases lacking semantics (Section "BRB layer").

Initialize embedding
JCLB first uses BERT to transform each token contained in the sentence into a vector v ∈ R d that consists of two parts: word embedding v w ∈ R d w and position embed- ding v p ∈ R d p , where d, d w , and d p are dimensions of v, v w , and v p , respectively.Hence, v is denoted as v = [v w , v p ] , concatenating word embedding and position embedding.Suppose that there are n tokens in a certain sentence, it can be transformed into a sentence matrix T ∈ R d×n .

Contrastive learning for NER
After obtaining the initial token vector, we introduce a contrastive learning objective to fine-tune the BERT.Contrastive learning is primarily applied in representation learning to alleviate the various idiosyncrasies of BERT.Its main purpose is to bring closer the embeddings of similar texts in the vector space while pushing apart those of dissimilar texts.As seen in Fig. 2, in NER, we aim for BERT's representations of token sequences belonging to the same entity type to be closer in the vector space, while being farther away from token sequences of other types.Based on this, we derive the vector representation for a contiguous sequence of tokens in a certain sentence with a start token in position i and an end token in position j as where Linear is a learnable linear layer, ⊕ denotes the vector concatenation, l(j − i) ∈ R d l is the (j − i)-th row of a learnable span width embedding matrix l ∈ R n×d l .Assuming predefined entity types, within a mini-batch, we obtain vector representations for all sequences representing the k-th entity type e k .The set of the vector rep- resentations is denoted as {span i } K i=1 .Then, the prototype of the vector representation set is calculated as Accordingly, the span-based infoNCE (Oord et al. 2019) can be defined as where span i,j denotes the sequence vector representation for an entity of type e k , S − k is the set of negative sequences that all exist in the mini-batch.
The span-based objective uniformly penalizes all non-entity token sequences.Therefore, to identify the boundaries of token sequences representing a specific entity, we propose a position-based objective. (1) (3) , Fig. 2 The contrastive learning objectives.In the mini-batch, entities such as Stuxnet, WannaCry, Mirai Botnet, and NotPetya are predefined "malware" entities.In the vector space, for "malware" entities, we define the prototype of corresponding entity representations as anchors, denoted by cross marks.Positive samples are representations of "malware" entities, indicated by green triangles, while negative samples are representations of other token sequences, represented by blue circles Intuitively, we want the initial (or final) tokens of entities of the same type to be closer to the embedding space.Specifically, we find the prototype of the initial (or final) token in the token sequence of entities of the same type in a mini-batch as where n is the number of the tokens in span i .Using p start k and p end k as anchors, the position-based objectives are defined by Finally, we achieve our overall contrastive objective by integrating the three discussed objectives as where α , , and γ are all hyper-parameters.

BiLSTM layer
In this section, we use BiLSTM for encoding the sentence matrix.In t-th time step, we first calculate a forgetting gate to determine what information to discard as . Secondly, we calculate the memory gate to select the information to be memorized as Thirdly, we calculate the current cell state to integrate the memory and forgetting gates, along with the temporary cell state and the previous cell state as , and the hidden layer state as h t = o t ⊙ tanh(C t ) .We can get the hidden layer state sequence with the same length as the sentence

Multi-head self-attention layer
After the encoding of embeddings is completed by BiLSTM, we use the MS layer to further capture the dependency between tokens in the sequence X = {v 1 , . . ., v n } (Manikandan et al. 2018;Jin et al. 2020;Liao et al. 2019) and improve the robustness of JCLB. (4) The specific calculation of the attention mechanism is described as the mapping from a query token Q = XW q to a series of key tokens K = XW k and value tokens V = XW v in the sentence, where W Q , W K , and W V are parameter matrices.The weight corresponding to each value token is obtained by calculating the similarity between the query token and each key token.The similarity between the query token and the key token is calculated by the dot product, and the attention score of the scaled dot product is as follows, To obtain the MS score, we perform the scaled dot product attention calculation process for h times, the input is mapped to h different subspaces through the parameter matrix, the scaled dot product attention score is calculated in turn, and the final result is spliced as the final attention score.The i-th self-attention vector is calculated as Finally, the MS score is calculated as where W o is a wight matrix.

CRF layer
In this layer, we regard the extraction of entities in cybersecurity as a sequence marking task.We assume the sequence as After the process- ing of the MS layer, we get an n × m matrix P, where n is the number of input tokens and m is the number of label types.The entry is the probability that the label i of the token j appears in the sentence.We represent y = {y 1 , y 2 , • • • , y n } as a marker sequence, so the model calculates the corresponding score: where, D ij is the transition probability from y i to y j .Then, we apply softmax to obtain the normalized probability: After that, we use the maximum logarithm function for training: (9) (10) ( .
Finally, in the prediction process, the Viterbi algorithm is used to calculate probability:

BRB layer
The difference between the BRB layer and the data-driven models mentioned above is that the BRB's internal structure can be explained (Yang et al. 2006).Additionally, compared to the aforementioned data-driven models, the BRB model has the ability to comprehensively utilize semi-quantitative information and describe all kinds of uncertain information (Yang et al. 2004).

Construction of BRB
We construct a BRB that comprises multiple rules.In these rules, we consider the CRF output as one of the premise attributes and also incorporate the use of powerful and easy-to-understand regexes as another necessary attribute.This helps to accurately identify cybersecurity entities with a fixed format but of no particular semantic relevance in cybersecurity incidents.Table 1 showcases some of the regexes we develop.R i,j refers to the j-th regex of the i-th entity category.Each rule and the premise attribute of the rule have a certain weight, and the latter part of the rule is matched with confidence to express the credibility of the conclusion.The BRB model can be described in the following form, where " ∧ " denotes that the rule is based on the inter- section assumption.(15) y * = argmax y ′ score(x, y ′ ). (16) With a rule weight θ k and attribute weight δ 1 , δ 2 , . . ., δ M , number of premise attributes.R k (k = 1, 2, . . ., L) denotes the k-th rule of BRB model, A k i (i = 1, 2, . . ., M) denotes the reference value of the i-th premise attribute in the kth rule, and D j (j = 1, 2, . . ., N ) denotes the j-th category, β j,k denotes the confidence of the j-th conclusion in the k-th rule, θ k denotes the weight of the k-th rule, and δ i denotes the weight of the i-th premise attribute.The structure of the BRB model is shown in Fig. 3.For example, the rule If the identification result of the category of the entity output by CRF is "Identifier", and the entity can match R 1,1 , then the confidence that the entity is "Identi- fier" is 100% can be expressed as If(Identifier is true)∧ (R 1,1 is true), then (Identifier, 100%).

Inference of BRB
After modeling the BRB, the input will activate the corresponding rules, and the inference results will be obtained by integrating the activation rules through the Evidential Reasoning (ER) algorithm.
Firstly, if the input of the m-th premise attribute is x m (m = 1, . . ., M) , its matching degree with the reference is calculated as follows, Table 1 Regexes for entities in CSS where ε m represents the certainty of the input.For exam- ple, ( x m , ε m = 90% ) represents that the certainty of x m is 90%.ϕ(x m , A k m,j )(j ∈ (1, 2, . . ., J m )) denotes the matching degree between input information x m and reference value , where J m represents the number of reference values of the m-th premise attribute.Since the input of our BRB model is qualitative information and a is in the form of semantic fuzzy value, the matching degree can be obtained directly.If entities in cybersecurity have 10 reference values (10 types), it must be one of them for any input.
Afterward, we calculate the activation weight of the rule.w k is the activation weight, that is, the activation degree of the input information to the rule.The calculation process of activation weight is as follows, where, α k m is the matching degree of the input information relative to the reference value, θ k is the initial rule weight, and δ m is the initial attribute weight.If w k =0, the rule is not activated.
Finally, after converting the input information into the matching degree with the reference value and obtaining the activation degree of the corresponding rules, the ER algorithm is used to integrate the activation rules.

Parameter optimization
According to the calculation process in the previous section, an objective function for optimizing BRB model parameters can be established, which is expressed as follows: where represents the parameter vector of the BRB, { θ 1 , . . .,θ K } represent the weights of the K rules, { β 1,1 , . . ., β N ,K } represent the confidences of the output conclusion, and { A 1,1 , . . ., A M,K } represent the reference values of the premise attributes.Let E u represent the error of classi- fication results, if there is ĵ = j , then E u = 0 , otherwise E u = 1 .Then f(ω) can be described as To address the challenge of optimization with constraints and high dimensions, we propose the D-CMA-ES algorithm.This algorithm initially divides the high-dimensional search space into several subspaces having relatively lower dimensions and then applies the ( 19) CMA-ES algorithm for searching in these low-dimensional subspaces.Subsequently, the solutions from each search are integrated to obtain the solution to the original problem.For example, if the dimension of the original space is 4, the algorithm divides the space into two subspaces, with each subspace having a dimension of 2. Each constraint is transformed into a specific unconstrained objective function that is independently optimized in each iteration to ensure that the solution always satisfies constraints.The D-CMA-ES algorithm is shown as Algorithm 1.

Experiment settings
In this section, we describe the datasets, baseline models, implementation details, and evaluation metrics of experiments.

Datasets
JCLB is evaluated on two datasets as follows.In the labeling task, BIO mode is adopted, in which "B" (Begin) identifies the starting position of entities, "I" (Inside) identifies the token inside entities, and "O" (Outside) identifying the token is not in any entities.After data labeling, we divided the cybersecurity data into a 70% training set, a 10% validation set and a 20% test set for scientifically and reasonably evaluating our method proposed.

Implementation details
As the framework includes the necessary hyper-parameters required for model training, this section outlines the main hyper-parameters employed.To embed tokens, the dimension is set to 768.For sequence coding, the hidden layer of both the forward and reverse LSTM is comprised of 300 neurons, and the dropout strategy is utilized in the BiLSTM feature coding layer to prevent over-fitting.The MS module consists of a size of K and V at 64 and the number of heads at 3. During model training, the epoch is set to 100 with a batch size of 128.Further, the updated model parameters are trained through random gradient descent, with the initial learning rate being 0.001.Table 3 exhibits the specific values for these hyper-parameters.The hyper-parameters α , γ , and are determined through a grid search, where the step size is 0.1, and the range is

Baseline models
We compare JCLB with six baseline models as follows. •

Metrics
To evaluate the performance of the model, the common evaluation values in the information extraction tasks are used, Precision (P), Recall (R), and F1.P represents the percentage of correct samples identified by the model in all identified samples, and R represents the percentage of correct samples identified by the model in all identified samples.The F1 value is the harmonic average of accuracy and recall, which is used to evaluate the comprehensive performance of the model.Each evaluation index is formally expressed as follows: where TP (True Positive) is the number of positive samples with a correct prediction, FP (False Positive) is the number of positive samples with a wrong prediction, and FN (False Negative) is the number of negative samples with a wrong prediction.

Main results
Table 4 presents the performance of various models on the Bridges et al. 's collected dataset and our collected dataset OpenCS.We report P, R, and F1.Our JCLB achieves state-of-the-art performance on the two datasets.JCLB outperforms all previous NER models in Cybersecurity, with F1 scores of 94.73% and 91.13% on the respective datasets.Notably, compared to the prior best model (Wang and Liu 2023), our approach demonstrates an improvement in F1 on the OpenCS dataset, with an absolute increase of +0.54%.It's worth noting that the previous models are built on different encoders such as LSTM, BERT, and PERT.In summary, our proposed JCLB, involving contrastive learning for fine-tuning BERT and utilizing BRB for noise filtering, represents a substantial advancement over previous models.

Ablation study
In this section, we degenerate our JCLB into several models for ablation study using the two datasets.The models are CRF (C), BiLSTM-CRF (BIC), BERT-CRF (BEC), BiLSTM-MS-CRF (BIMC), BERT-MS-CRF (BEMC), BiLSTM-BiLSTM-CRF (BBC), ( 21) and BERT-BiLSTM-MS-CRF-BRB (BBMCB).C takes into account the content information of data and the change information between data labels when modeling.Its related models have achieved good results in many natural language processing tasks.The bidirectional structure in BIC can be determined according to the context at the same time (Wu et al. 2019).In BEC, the words of each word position can be encoded directly regardless of direction and distance.BERT in BBC can embed tokens through its rich semantic knowledge (Cai et al. 2020).We compare the effect of MS and BRB on the performance of the model.We keep the hyper-parameters unchanged during the training of each model.The experimental results are in Table 5. Figure 4 illustrates the F1 for ten different entities on the OpenCS dataset.We can observe that JCLB surpasses other models and achieves the highest F1 on the two datasets.Additionally, BIC performs significantly better than C, with an F1 improvement of 3.44%.This is because BIC addresses the issue of long-distance dependence on long-sequence modeling.In contrast, compared to using BiLSTM as a feature extractor, adding the BERT model yields a better F1.Furthermore,   when compared with BIC and BEC, BIMC and BEMC show improvements of 0.20% and 0.83% in F1, respectively.Since the MS layer is employed to capture the dependency weight of feature coding between any two tokens.

Analysis on contrastive learning
We present a comparative analysis of variants of our contrastive learning method, detailing their test performance on the OpenCS dataset in Table 6.We observe an F1 decline across all these variants.We speculate that when contrastive learning focuses solely on the initial and final tokens of an entity, neglecting its span-based token sequence, it might result in the loss of semantic information contained within the intermediate tokens.In certain scenarios, these middle tokens could provide key insights into how the entity interacts with the context.Ignoring these tokens could diminish the model's understanding of the entity's meaning.Conversely, when contrastive learning focuses solely on span-based token sequences of entities, ignoring the initial and final tokens, the model overlooks the semantic information of entity boundaries carried by these tokens, leading to an inadequate understanding of the entity's integrity.In Fig. 5, we visualize the average similarity of token sequence representations for in-batch positives and negatives sampled in span-based and position-based contrastive learning.For position-based contrastive learning, we show the mean similarity between the initial and final tokens.We observe that, as training progresses, the similarity for in-batch negatives rapidly decreases, indicating that in-batch negatives provide limited gradient signals.In contrast, the similarity for in-batch positives remains high and is distinctly separated from the negatives, suggesting that our method effectively enhances the similarity of token sequence representations for the same type of entities in the vector space.

Analysis on BRB
We evaluate the impact of using BRB on the OpenCS dataset.The experimental results are shown in Table 7, where "✓" indicates that the BRB is combined and "×" indicates that the BRB is not used.Integrating BRB enhances BBMCB with an 89.66% F1, contrasting with 85.33% when BRB is absent.In BEMC, BRB integration results in an 85.20% F1, while its removal leads to a 3.59% F1 decrease.Similarly, JCLB achieves a 91.13%F1 with BRB, but a 2.19% F1 decrease occurs without BRB.These experiments reveal BRB's notable recognition capability, particularly for entities with fixed formats.
We employ several constrained optimization algorithms, Sequential Quadratic Programming (SQP) and Differential Evolution (DE), to optimize BRB in JCLB.SQP is a traditional yet effective optimization algorithm addressing constraint problems through sequential quadratic programming subproblems.DE is an intelligent evolutionary algorithm solving constraint problems through  specialized operations.Figure 6 displays the results of these optimization algorithms, revealing that the model attains optimal recognition performance for each category when utilizing D-CMA-ES.

Case study
As seen in Table 8, we take the proposed JCLB and BBC models as examples to analyze the recognition results of entities in CSS, which are typed using abbreviations, proprietary nouns, or fixed format phrases without semantics.In the second line, we list some abbreviations like "POODLE" and "BGP"."POODLE" is the abbreviation of "Padding Oracle On Downgraded Legacy Encryption" and "BGP" is the abbreviation of "Border Gateway Protocol".The highlighted characters in Table 4 indicate that "BGP" is correctly recognized as entities, but BBC does not correctly identify "POODLE".This is because the BBMCB model uses an MS module, which encodes the input words and pays attention to other words in the context at the same time so that the label will not have the problem of difference.In the third line, "Eternal Blue" is a network attack tool that the BBC did not correctly identify.Given noise cybersecurity entities in the text, as seen in the fourth line, the BBC did not correctly identify the IP address "1.1.1.1",but it identified the malformed "255.255.255.256" as a subnet mask.The BBC can not identify the noise in the text.

Conclusion
In this paper, we propose JCLB, a novel model for NER in cybersecurity.JCLB employs contrastive learning to establish objectives based on span and position, thereby fine-tuning BERT.This method enhances the similarity of token sequence representations for the same type of entities in vector space, reducing the impact of anisotropy on encoding quality.We also demonstrate the feasibility of applying BRB to filter noise and the advantages of improving the recognition of fixed format entities.When optimizing BRB parameters, compared with the CMA-ES algorithm, we propose the D-CMA-ES algorithm, which adaptively divides samples into multiple subspaces for sampling, effectively avoiding the negative impact of x i (i = 1, 2, . . ., M) denotes the i- th premise attribute of BRB model, and M denotes the (14) log P(y x |x) = score(x, y x ) − log

Fig. 3
Fig. 3 Structure of the BRB

Fig. 4
Fig. 4 Performance of recognition for ten categories of cybersecurity entities achieved by various models

Fig. 5 a
Fig. 5 a Variation for similarity of in-batch positive and negative pairs in span-based contrastive learning.b Variation for similarity of in-batch positive and negative pairs in position-based contrastive learning

Fig. 6
Fig.6The performance of JCLB with different optimization algorithms for BRB

Table 2
Size of each type of entities

Table 3
Hyper-parameters in the JCLB Gao et al. (2021)introduce a data and knowledgedriven NER model for cybersecurity.The input layer incorporates an external dictionary as an auxiliary knowledge database to enhance word representation.• Wu et al. (2022) utilize BiLSTM, CNN, and CRF for NER.Specifically, they employ a linear stack of LSTM and CNN in the deep neural network layer for a more efficient global and local feature representation.• Wang and Liu (2023) develop a graph RNN, GARU, integrating diverse features extracted from GNNs and RNNs.Additionally, they introduce an entity boundary detection module for predicting entity heads and tails.

Table 4
Comparison of different models on two datasets

Table 5
Ablation study on two datasets F1 scores in bold indicate the best results

Table 6
Ablation study on contrastive learning

Table 7
Ablation study on BRB F1 scores in bold indicate better results obtained w/ or w/o BRB