佐助是什么意思| 翠玉是什么玉| 211是什么意思| 羊经后半边读什么| 血压表什么牌子的好最准确最耐用| 蚂蚁最怕什么东西| 内热外寒感冒用什么药| 眼珠发黄是什么原因| 舌苔白有齿痕吃什么药| 胎儿右侧脉络丛囊肿是什么意思| 乙肝看什么指标| hpv39阳性是什么意思| 淘宝什么时候成立的| 一个月一个元念什么| 茵陈有什么功效| 81是什么节日| 孔子是什么家| 轶是什么意思| 什么的飞机| 左心室肥大是什么意思| 孕妇心情不好对胎儿有什么影响| 54岁属什么的| 医学美容技术学什么| 早博是什么| 减肥晚上吃什么水果| 中暑什么意思| 肛塞是什么| 生目念什么| 白细胞十十是什么意思| 为什么会得麦粒肿| 走婚是什么意思| 深圳副市长什么级别| 为什么会长丝状疣| 嗣读什么| 怎么知道自己缺什么五行| 普洱茶有什么功效与作用| 为什么会有狐臭| 蛇为什么怕鹅| 孕32周需要做什么检查| 什么药吃了死的快| 喝苹果醋有什么好处和坏处| 直接胆红素高是什么病| 黑苦荞茶适合什么人喝| 为什么一直打哈欠| 中暑喝什么好| 五粮液是什么香型的酒| 什么房不能住人| 渗湿是什么意思| 左脚麻是什么原因| 芈月是秦始皇的什么人| 优甲乐是治什么病的| 伤骨头了吃什么好得快| 喝断片了是什么意思| 为什么会梦到前男友| 煞笔是什么意思| 大象吃什么| 省内流量是什么意思| 宫腔粘连是什么意思| 农历六月是什么生肖| april是什么意思| 疱疹不能吃什么食物| 上校相当于政府什么官| 气血不足补什么| 取缔役什么意思| style是什么意思| 脸上长痘痘什么原因| 为什么老是做梦| 光合作用是什么| 弟弟的孩子叫姐姐什么| 南无阿弥陀佛是什么意思| 中旬是什么意思| 晚上适合吃什么| 吃什么长内膜| 为什么阴道会排气| 黑眼圈严重是什么原因| 疱疹不能吃什么食物| 直肠炎用什么药效果最好| 硫酸镁是什么| 肿瘤cr是什么意思| 什么是五险一金| 什么的长城| 五楼五行属什么| 品牌主理人是什么意思| 大小眼是什么原因| 年下是什么意思| 巽是什么意思| 什么的足迹| 2月9日什么星座| 迎春花像什么| 饺子是什么意思| 热伤风吃什么药好得快| 吃生姜对身体有什么好处| 心机血缺血吃什么药最好| 离职什么意思| 花生碎能做什么食物吃| 中暑什么症状| 吹空调头疼吃什么药| 胃切除有什么影响| 何弃疗是什么意思| 东方不败练的什么武功| 血压低是什么原因造成的| 屁股下垂穿什么裤子| 贴黄瓜片对皮肤有什么好处| 黄瓜和什么不能一起吃| 十二指肠球部溃疡吃什么药| loa胎位是什么意思| 养神经的药是什么药最好| 养胃喝什么茶好| 横店是什么| 梦见卖鱼是什么意思| 引产是什么意思| 优甲乐是治什么病的| 低血糖挂什么科| 舌头干燥吃什么药| 不服是什么意思| 排卵期是指什么时候| 宝宝睡觉头上出汗多是什么原因| 梦见猫头鹰是什么预兆| 肝胆湿热喝什么茶| 息肉样增生是什么意思| 什么是违反禁令标志指示| 搪塞是什么意思| 翻盘是什么意思| 吃海带有什么好处| 暖气是什么意思| 什么是日间手术| 排卵期出血是什么原因造成的| 瞑眩反应是什么意思| 脂肪酶高是什么原因| mA是什么| 牙龈无缘无故出血是什么原因| 黑色加什么颜色是棕色| 嗓子疼不能吃什么| 尿蛋白高吃什么药| 青海湖里面有什么鱼| 脂肪瘤长什么样| 立普妥是什么药| 爸爸的爸爸叫什么| 韭菜什么时候种最好| 阴虱用什么药物| 优对什么| 长期服用丙戊酸钠有什么副作用| 血管检查是做什么检查| 马首是瞻是什么生肖| maxrieny是什么品牌| 骨穿是检查什么病| 养狗需要注意什么| 鳖孙是什么意思| 属虎和什么属相最配| 无毒不丈夫是什么意思| 梦见大蟒蛇是什么征兆| 比熊吃什么牌子的狗粮好| 阴茎越来越小是什么原因| 县政府党组成员什么级别| 情窦初开是什么意思| 球蛋白偏高是什么原因| 公安局跟派出所有什么区别| 神的国和神的义指的是什么| 朝鲜说什么语言| 专升本有什么专业| 改姓需要什么手续| 大拇指指甲凹凸不平是什么原因| 咽喉有异物感吃什么药| 大肠杆菌感染吃什么药| 佛光普照是什么生肖| 羊蝎子是什么东西| 轮状病毒是什么症状| 眼白出血是什么原因| 手麻挂什么科室| 胆囊炎要注意些什么| 鸟在电线上为什么不会触电| 男性尿道炎吃什么药| 白带增多是什么原因| 六月十二号是什么星座| 运交华盖是什么意思| 六味地黄丸治什么| 为什么睡不醒| 阿q精神是什么意思| lv的全称是什么| 尿道感染流脓吃什么药| 理发师代表什么生肖| 什么叫脑白质病| cts是什么意思| 脚后跟干裂用什么药膏| 起灵是什么意思| 藤茶有什么功效| 尿里有潜血是什么原因| 肌肉型肥胖是什么意思| 感染hpv有什么症状| 心身医学科是看什么病| 尿葡萄糖阴性什么意思| 小哥哥是什么意思| 藿香正气水能治什么病| 痰是棕色的是什么原因| 印度人为什么用手抓饭吃| 自闭是什么意思| 晚上喝柠檬水有什么好处| 眼皮痒是什么原因| 吃二甲双胍为什么会瘦| 女人梦见掉头发是什么征兆| 普洱属于什么茶| 付诸东流是什么意思| 石足念什么| 2002年是什么命| 为什么打雷| 自己家院子种什么树好| 5月5日是什么星座| 身正不怕影子斜是什么意思| 掂过碌蔗是什么意思| 什么野菜| 血糖高的人吃什么水果好| 氯化钠敷脸有什么作用| 阴虚吃什么药效果最好| 什么情况需要割包皮| 阴道流黄水是什么病| 属虎男和什么属相最配| 100岁是什么之年| 老是掉发是什么原因| 羊肉饺子馅配什么蔬菜最好吃| 优越感是什么意思| 梦到和老公吵架是什么意思| 陌路人是什么意思| 小月子是什么意思| 翻车鱼为什么叫翻车鱼| 解脲支原体阳性是什么意思| 高血糖能吃什么水果| 什么是圆周率| 姑姑的孙子叫我什么| 现在有什么好的创业项目| 丙球是什么| 片状低回声区什么意思| 牛宝是什么| 氨糖是什么| ser是什么氨基酸| 空腹打嗝是什么原因引起的| 白灼虾是什么虾| 婴儿口臭是什么原因引起的| 灰指甲用什么药好| 万艾可是什么药| 吃酸的有什么好处| 碳素墨水用什么能洗掉| 怕金森是什么症状| 姑姑和我是什么关系| 白萝卜煮水喝有什么功效和作用| 76年属什么的生肖| 施华蔻属于什么档次| hpv52型阳性是什么意思严重吗| 主食是什么意思| 慢性鼻炎用什么药| 高血压一般在什么年龄| 以逸待劳是什么意思| 穷奢极欲什么意思| 12月13日是什么纪念日| 命理是什么意思| 十月十二号是什么星座| 女人肾虚吃什么药调理| 年糕是什么做的| 鱼腥味是什么妇科病| 无公害什么意思| 三五行属什么| 貌不惊人是什么意思| 检查胸部挂什么科| 柠檬加蜂蜜泡水喝有什么功效| 12月4日是什么日子| 百度

Zhihao Peng ?Liuxin Bao ?Shengyuan Liu ?Yixuan Yuan
Chinese University of Hong Kong
Corresponding author (yxyuan@ee.cuhk.edu.hk)
Abstract
百度 政协委员、民盟北京市委专职副主委宋慰祖认为,中轴线是世界城市建设史上最杰出的城市设计范例之一,对其进行保护势在必行。

The collaborativeness of large language models (LLMs) has proven effective in natural language processing systems, holding considerable promise for healthcare development. However, it lacks explicit component selection rules, necessitating human intervention or clinical-specific validation. Moreover, existing architectures heavily rely on a predefined LLM cluster, where partial LLMs underperform in medical decision support scenarios, invalidating the collaborativeness of LLMs. To this end, we propose an adaptive cluster collaborativeness methodology involving self-diversity and cross-consistency maximization mechanisms to boost LLMs medical decision support capacity. For the self-diversity, we calculate the fuzzy matching value of pairwise outputs within an LLM as its self-diversity value, subsequently prioritizing LLMs with high self-diversity values as cluster components in a training-free manner. For the cross-consistency, we first measure cross-consistency values between the LLM with the highest self-diversity value and others, and then gradually mask out the LLM having the lowest cross-consistency value to eliminate the potential inconsistent output during the collaborative propagation. Extensive experiments on two specialized medical datasets, NEJMQA and MMLU-Pro-health, demonstrate the effectiveness of our method across physician-oriented specialties. For example, on NEJMQA, our method achieves the accuracy rate up to the publicly official passing score across all disciplines, especially achieving ACC of 65.47% compared to the 56.12% achieved by GPT-4 on the ‘Obstetrics and Gynecology’ discipline.

1 Introduction

In the past decades, considerable efforts have been made in developing traditional machine learning approaches and deep learning-based models, enhancing the accuracy and accessibility of medical decision support systems. Nevertheless, a substantial gap remains between the development of major medical decision support algorithms and their clinical deployment in the healthcare domain, as they fail to reach a physician-like level in specific specialties. Recently, the emergence of large language models (LLMs) has substantially advanced the natural language processing domain. Such rapid advancement of LLMs ouyang2022training ; achiam2023gpt ; chen2024more ; liang2024can ; yuksekgonul2025optimizing holds considerable promise for penetrating from general to domain-specific fields, with extreme interest in healthcare applications thirunavukarasu2023large ; xu2023knowledge ; li2024mediq ; katz2024gpt ; chen2025map . A key enabler of this advancement is the collaborativeness of LLMs wang2025mixtureofagents - an inherent phenomenon where multiple LLMs tend to generate higher-quality outputs through referenced interactions. Various approaches leveraging this capability have demonstrated substantial improvements in natural language understanding and generation li2023camel ; liang2024encouraging ; chan2024chateval ; zhang-etal-2024-exploring ; estornell2024multi ; feng2024don . For instance, Du et al. du2023improving encourages multiple LLMs to iteratively propose and debate their individual outputs through multi-round discussions to reach a consensus. Wang et al. wang2025mixtureofagents surpasses GPT-4 Omni achiam2023gpt via iterative aggregation of outputs, with each layer selecting the inputs from the previous layer through prompt engineering. Li et al. li2025rethinking aggregates multiple outputs from a single best-performing LLM during iterative aggregation to enhance the inference performance. However, these models exhibit component-wise uncertainty due to the absence of explicit selection criteria for cluster components. Furthermore, most existing models heavily rely on a predefined architecture, where some LLMs may introduce medical misinformation into collaborative propagation, ultimately compromising system performance. Nevertheless, few studies have focused on evaluating the collaborativeness of LLMs concerning the physician-level medical decision support capacity, yet improving its accessibility and accuracy can significantly reduce medical decision errors and optimize treatment pathways. It is worth noting that healthcare stands to benefit significantly from advances in the collaborativeness of LLMs, and such technology complements rather than replaces physicians, particularly in resource-limited settings where reliable physicians across a specific specialty are scarce keeler2006reducing ; shen2021artificial ; dvijotham2023enhancing ; li2024mediq ; kim2024mdagents .

Refer to caption
(a) NEJMQA
Refer to caption
(b) MMLU-Pro-health
Figure 1: Comparisons on (a) NEJMQA and (b) MMLU-Pro-health demonstrate our substantial performance improvements across diverse disciplines in medical decision support scenarios.
Refer to caption
Figure 2: For LLMs of equal parameter size, a higher SD value correlates with better performance in medical decision support tasks. 11 eligible cases (11 of 12) are highlighted with a green dashed box.

In our preliminary study, we empirically find that existing models leveraging the collaborativeness of LLMs underperform in medical decision support scenarios, often yielding inferior results compared to single LLMs, as illustrated in Figure?1. This performance degradation may stem from partial LLMs exhibiting over-confidence wen2024mitigating in their incorrect outputs, thereby propagating medical misinformation that compromises the collaborativeness of LLMs.

To this end, we propose an adaptive cluster collaborativeness methodology involving self-diversity (SD) and cross-consistency (CC) maximization mechanisms to enhance LLMs medical decision support capacity. Specifically, we first propose a SD maximization mechanism to select LLMs with the high output diversity as cluster members since we observe that LLMs generating more diverse outputs tend to achieve better performance. Figure 2 shows that eleven of twelve LLMs (highlighted with a green dashed box) follow the pattern where higher SD values correlate with higher accuracies. The exception is Llama3-Instruct-70B (highlighted with a red dashed box), which is potentially due to its training of the output format. We then measure pairwise CC values between the LLM with the highest SD value and others for the subsequent mask operation. Afterward, we iteratively exclude the LLM with the lowest pairwise cross-consistency value and propagate the remaining outputs to the next layer. In this way, we can iteratively mask LLMs layer by layer, where each LLM generates its output by integrating all outputs from the previous layer as an auxiliary context. Experiments on two specialized medical datasets, NEJMQA and MMLU-Pro-health, demonstrated substantial improvements with our method, indicating the physician-level medical decision support capacity. Specifically, on NEJMQA, the Israel 2022 medical specialist license examination, our method achieves an accuracy rate of up to the passing score (i.e., 65%) across all disciplines: General Surgery, Internal Medicine, Obstetrics and Gynecology, Pediatrics, and Psychiatry. In particular, our method achieves ACC of 65.47% on ‘Obstetrics and Gynecology‘’ disciplines compared to the previous best of 56.12% achieved by GPT-4.

The contributions of this work are summarized as follows: (i)??(i)( italic_i ) We find that the collaborativeness of LLMs tends to be invalidated in medical decision support scenarios because not only do some LLMs lack numerous medical data for model training or fine-tuning, but using underperformed LLMs may introduce medical errors in the collaborative interaction, resulting in ambiguous and unreliable results. (i?i)????(ii)( italic_i italic_i ) We propose the SD maximization mechanism based on the empirical observation that a single LLM with more diverse outputs tends to achieve better performance, selecting LLMs with high diversity values as cluster members to construct the LLM cluster. (i?i?i)??????(iii)( italic_i italic_i italic_i ) We propose the CC maximization mechanism to iteratively mask LLMs layer by layer, achieving adaptive collaborativeness and effectively avoiding performance degradation caused by the underperformance of individual LLMs. (i?i?i?i)????????(iiii)( italic_i italic_i italic_i italic_i ) Empirical evaluations conducted on two specialized medical datasets, NEJMQA and MMLU-Pro-health, demonstrate our substantial performance improvements in medical decision support scenarios. For instance, on NEJMQA, our method achieves accuracy rates up to the passing threshold of 65% across all disciplines, especially attaining an accuracy of 65.47% in the ‘Obstetrics and Gynecology’ discipline, compared to the second-best result of 56.12% achieved by GPT-4.

Refer to caption
Figure 3: Illustration of the proposed adaptive cluster collaborativeness. We first measure pairwise cross-consistency values between the LLM with the highest SD value and other models. Then, we iteratively mask the LLM showing the lowest pairwise CC value in the current layer and propagate only the outputs from remaining LLMs to the next layer. This adaptive mask mechanism significantly reduces the inconsistency of concatenated outputs while ensuring each LLM generates outputs based exclusively on outputs of screened LLMs from the previous layer as a contextual reference rather than considering entire models.

2 Related Work

2.1 LLM Reasoning

In recent years, LLMs have exhibited increasingly remarkable performance across a wide range of mathematical, scientific, and programming benchmarks wei2022chain ; zhou2022least ; yao2023tree ; besta2024graph . This progress is primarily attributed to the emergence of reasoning techniques, which have become pivotal methods for enhancing the inferential capabilities of LLMs. Chain-of-Thought (CoT) addresses complex problems by guiding the model to generate a sequence of intermediate reasoning steps wei2022chain . Least-to-Most Prompting (LtM) decomposes a task into a series of subproblems solved in order, where the solution to each subproblem supports subsequent ones zhou2022least . Tree-of-Thought (ToT) employs a tree structure that enables the model to explore multiple reasoning paths in parallel yao2023tree . Skeleton-of-Thought (SoT) improves generation efficiency by first producing an outline of the output, then filling in details in parallel ning2023skeleton . Graph-of-Thought (GoT) offers a more dynamic reasoning paradigm by modeling the reasoning process as a graph of interconnected thought nodes besta2024graph .

2.2 Collaborativeness of LLMs

Recent studies have demonstrated that the collaborativeness of LLMs can effectively integrate their respective strengths, thereby enhancing the ability to solve complex problems li2025rethinking ; kim2024mdagents ; wang2025mixtureofagents . Existing frameworks can be broadly classified into two categories. The first framework is commonly referred to as role-playing. In this paradigm, multiple LLMs are assigned distinct roles or responsibilities, with each model focusing on tasks specific to its designated function kim2024mdagents ; wang2024survey . Through collaborative interactions, LLMs collaborate together to achieve complex overarching objectives. With the clear division of labor, this approach enables the effective decomposition of complex problems and leverages the specialized competencies of each LLM to generate integrated and comprehensive solutions. The second framework is referred to as multi-LLM debate du2023improving . In this paradigm, each LLM first attempts to solve the problem independently and then analyzes outputs of other LLMs to reach a consensus. Within this framework, existing works can be further delineated according to the composition and interaction strategies of the participating LLMs.

From the perspective of LLM composition, existing works can be classified into two main categories: debates involving multiple instances of a single LLM li2025rethinking and debates among heterogeneous LLMs wang2025mixtureofagents ; wang2024rethinking . In terms of deliberation mechanisms, representative strategies include majority voting schemes wang2022self , interdisciplinary collaborativeness paradigms tang2023medagents , structured group discussions chen2023reconcile , and negotiation-based protocols fu2023improving . Each of these approaches offers distinct advantages in facilitating consensus formation and improving the robustness of the solution.

3 Adaptive Cluster Collaborativeness Methodology

This section introduces the proposed adaptive cluster collaborativeness methodology, which involves the SD maximization mechanism for cluster construction and the CC maximization mechanism for adaptive collaborativeness, as illustrated in Figure 3.

3.1 Cluster Construction of LLMs

As aforementioned, the collaborativeness of LLMs exhibits component-wise uncertainty where its cluster components lack explicit selection rules, making significant barriers to practical healthcare applications. Additionally, existing architectures wang2025mixtureofagents ; li2025rethinking heavily rely on a predefined LLM cluster with model sizes reaching 141B parameters, which imposes severe limitations on real-world healthcare deployment due to excessive computational resource requirements.

To this end, we propose an SD maximization mechanism that selects LLMs exhibiting a high diversity value within the scope of accessible resources to achieve the adaptive cluster construction. Such a mechanism is motivated by an empirical observation that LLMs generating more diverse outputs tend to achieve better performance, as illustrated in Figure?2. Accordingly, we select LLMs exhibiting a high diversity value from the candidate models as cluster components, where the detail is as follows:

We first employ a fast string matching algorithm max_bachmann_2024_10938887 to calculate the output diversity of LLMs since it is useful for detecting partial matches in string data. Specifically, we sample 10 outputs from a single LLM ????subscript????\mathbf{L}_{\bf{I}}bold_L start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT to the same question, denoted as {????j}j=110superscriptsubscriptsuperscriptsubscript????????110\{\mathbf{O}_{\bf{I}}^{j}\}_{j=1}^{10}{ bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT. For any given pair of outputs, take ????1superscriptsubscript????1\mathbf{O}_{\bf{I}}^{1}bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and ????2superscriptsubscript????2\mathbf{O}_{\bf{I}}^{2}bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (|????1||????2|superscriptsubscript????1superscriptsubscript????2|\mathbf{O}_{\bf{I}}^{1}|\leq|\mathbf{O}_{\bf{I}}^{2}|| bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT | ≤ | bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT |) as an example, we compute their similarity by finding the best matching substring of ????2superscriptsubscript????2\mathbf{O}_{\bf{I}}^{2}bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT that aligns with ????1superscriptsubscript????1\mathbf{O}_{\bf{I}}^{1}bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. For each position i????2??superscriptsubscript????2i\in\mathbf{O}_{\bf{I}}^{2}italic_i ∈ bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the substring is obtained as follows:

????s?u?b=????2[i:i+|????1|],s.t.i{0,,|????2|?|????1|},\mathbf{O}_{\bf{I}}^{sub}=\mathbf{O}_{\bf{I}}^{2}[i:i+|\mathbf{O}_{\bf{I}}^{1}% |],\quad\mathrm{s.t.}\quad i\in\{0,\dots,|\mathbf{O}_{\bf{I}}^{2}|-|\mathbf{O}% _{\bf{I}}^{1}|\},bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_b end_POSTSUPERSCRIPT = bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ italic_i : italic_i + | bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT | ] , roman_s . roman_t . italic_i ∈ { 0 , … , | bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | - | bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT | } , (1)

where ????1superscriptsubscript????1\mathbf{O}_{\bf{I}}^{1}bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT slides over ????2superscriptsubscript????2\mathbf{O}_{\bf{I}}^{2}bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with a window of size |????1|superscriptsubscript????1|\mathbf{O}_{\bf{I}}^{1}|| bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT |. The similarity value of each window can be computed using Levenshtein distance levenshtein1966binary with its mathematical definition being

s?i?m?(????1,????s?u?b)=(1?D?(????1,????s?u?b)|????1|)×100,??????superscriptsubscript????1superscriptsubscript??????????1??superscriptsubscript????1superscriptsubscript??????????superscriptsubscript????1100{sim}(\mathbf{O}_{\bf{I}}^{1},\mathbf{O}_{\bf{I}}^{sub})=\left(1-\frac{D(% \mathbf{O}_{\bf{I}}^{1},\mathbf{O}_{\bf{I}}^{sub})}{|\mathbf{O}_{\bf{I}}^{1}|}% \right)\times 100,italic_s italic_i italic_m ( bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_b end_POSTSUPERSCRIPT ) = ( 1 - divide start_ARG italic_D ( bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_b end_POSTSUPERSCRIPT ) end_ARG start_ARG | bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT | end_ARG ) × 100 , (2)

where D?(????1,????s?u?b)??superscriptsubscript????1superscriptsubscript??????????D(\mathbf{O}_{\bf{I}}^{1},\mathbf{O}_{\bf{I}}^{sub})italic_D ( bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_b end_POSTSUPERSCRIPT ) denotes the Levenshtein distance of ????1superscriptsubscript????1\mathbf{O}_{\bf{I}}^{1}bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and ????s?u?bsuperscriptsubscript??????????\mathbf{O}_{\bf{I}}^{sub}bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_b end_POSTSUPERSCRIPT. Afterward, the output diversity of ????1superscriptsubscript????1\mathbf{O}_{\bf{I}}^{1}bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and ????2superscriptsubscript????2\mathbf{O}_{\bf{I}}^{2}bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, termed as the SD value, can be computed as:

d?i?v?(????1,????2)=100?max?(s?i?m?(????1,????s?u?b)).??????superscriptsubscript????1superscriptsubscript????2100??????superscriptsubscript????1superscriptsubscript??????????{div}(\mathbf{O}_{\bf{I}}^{1},\mathbf{O}_{\bf{I}}^{2})=100-\max\left({sim}(% \mathbf{O}_{\bf{I}}^{1},\mathbf{O}_{\bf{I}}^{sub})\right).italic_d italic_i italic_v ( bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = 100 - roman_max ( italic_s italic_i italic_m ( bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_b end_POSTSUPERSCRIPT ) ) . (3)

Similarly, we can obtain the SD values for all other pairwise outputs in {????j}j=110superscriptsubscriptsuperscriptsubscript????????110\{\mathbf{O}_{\bf{I}}^{j}\}_{j=1}^{10}{ bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT, resulting in a total of 45 SD values (i.e., C102superscriptsubscript??102C_{10}^{2}italic_C start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT). Finally, we take the mean of the above SD values as the final SD value for the LLM ????subscript????\mathbf{L}_{\bf{I}}bold_L start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT, where the SD value of the LLM is higher, its output is more diverse.

3.2 Adaptive Collaborativeness of Cluster

Previous models achieve the collaborativeness of LLMs through the iterative aggregation of entire outputs, where the current layer aggregates the outputs of all LLMs in the previous layer, inevitably leading to interference from low-quality redundant outputs and substantial time consumption.

To mitigate this issue, we use a CC maximization mechanism to iteratively mask the LLM with the lowest pairwise CC value layer by layer, allowing adjustable aggregation by setting the number of masked LLMs. The implementation involves three key steps: (1) measuring the pairwise CC value between the LLM with the highest SD value and other LLMs; (2) masking the LLM with the lowest pairwise cross-consistency value iteratively; (3) propagating the outputs of remaining LLMs, where each LLM within the current layer generates its output by integrating outputs of screened LLMs within the previous layer as auxiliary context.

The illustration is given in Figure 3 and its mathematical definition is as follows. Let ????,??????,????????,??????,????,??????subscript????subscript??????subscript????????subscript??????subscript????subscript??????\mathbf{L}_{\bf{I}},\mathbf{L}_{\bf{II}},\mathbf{L}_{\bf{III}},\mathbf{L}_{\bf% {IV}},\mathbf{L}_{\bf{V}},\mathbf{L}_{\bf{VI}}bold_L start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT , bold_L start_POSTSUBSCRIPT bold_II end_POSTSUBSCRIPT , bold_L start_POSTSUBSCRIPT bold_III end_POSTSUBSCRIPT , bold_L start_POSTSUBSCRIPT bold_IV end_POSTSUBSCRIPT , bold_L start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT , bold_L start_POSTSUBSCRIPT bold_VI end_POSTSUBSCRIPT be the cluster of LLMs, ????subscript????\mathbf{L}_{\bf{I}}bold_L start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT is the LLM with a highest SD value. First, we obtain the inferred output ??1subscript??1\mathbf{r}_{1}bold_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of the first layer by

??1=?(j??c?l?u??j(??0)+??0),s.t.??0=??0,??j1=??j(??0),\mathbf{r}_{1}=\bigoplus\left(\sum_{j\in\mathbf{L}_{clu}}\mathbf{L}_{j}(% \mathbf{r}_{0})+\mathbf{x}_{0}\right),\quad\mathrm{s.t.}\quad\mathbf{r}_{0}=% \mathbf{x}_{0},\quad\mathbf{O}_{j}^{1}=\mathbf{L}_{j}(\mathbf{r}_{0}),bold_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ? ( ∑ start_POSTSUBSCRIPT italic_j ∈ bold_L start_POSTSUBSCRIPT italic_c italic_l italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , roman_s . roman_t . bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_O start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = bold_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (4)

where ??c?l?u={??,????,??????,????,??,????}subscript??????????????????????????????\mathbf{L}_{clu}=\{\bf{I},\bf{II},\bf{III},\bf{VI},\bf{V},\bf{IV}\}bold_L start_POSTSUBSCRIPT italic_c italic_l italic_u end_POSTSUBSCRIPT = { bold_I , bold_II , bold_III , bold_VI , bold_V , bold_IV } denotes the cluster indexs of LLMs, ??0subscript??0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denotes the input information, +++ and \sum denote the concatenation of outputs, ??j?(??0)subscript????subscript??0\mathbf{L}_{j}(\mathbf{r}_{0})bold_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) denotes the output of LLM ??jsubscript????\mathbf{L}_{j}bold_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with ??0subscript??0\mathbf{r}_{0}bold_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT being the input, ?(?)direct-sum?\bigoplus\left(\cdot\right)? ( ? ) denotes the application of the aggregation prompt. For the sake of readability, we simplify their outputs ????1,??????1,????????1,??????1,????1,??????1superscriptsubscript????1superscriptsubscript??????1superscriptsubscript????????1superscriptsubscript??????1superscriptsubscript????1superscriptsubscript??????1\mathbf{O}_{\bf{I}}^{1},\mathbf{O}_{\bf{II}}^{1},\mathbf{O}_{\bf{III}}^{1},% \mathbf{O}_{\bf{IV}}^{1},\mathbf{O}_{\bf{V}}^{1},\mathbf{O}_{\bf{VI}}^{1}bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_O start_POSTSUBSCRIPT bold_II end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_O start_POSTSUBSCRIPT bold_III end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_O start_POSTSUBSCRIPT bold_IV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_O start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_O start_POSTSUBSCRIPT bold_VI end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT as ????,??????,????????,??????,????,??????subscript????subscript??????subscript????????subscript??????subscript????subscript??????\mathbf{O}_{\bf{I}},\mathbf{O}_{\bf{II}},\mathbf{O}_{\bf{III}},\mathbf{O}_{\bf% {IV}},\mathbf{O}_{\bf{V}},\mathbf{O}_{\bf{VI}}bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT , bold_O start_POSTSUBSCRIPT bold_II end_POSTSUBSCRIPT , bold_O start_POSTSUBSCRIPT bold_III end_POSTSUBSCRIPT , bold_O start_POSTSUBSCRIPT bold_IV end_POSTSUBSCRIPT , bold_O start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT , bold_O start_POSTSUBSCRIPT bold_VI end_POSTSUBSCRIPT subsequently. Afterward, we measure the pairwise CC values between ????subscript????\mathbf{O}_{\bf{I}}bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT and ??????,????????,??????,????,??????subscript??????subscript????????subscript??????subscript????subscript??????\mathbf{O}_{\bf{II}},\mathbf{O}_{\bf{III}},\mathbf{O}_{\bf{IV}},\mathbf{O}_{% \bf{V}},\mathbf{O}_{\bf{VI}}bold_O start_POSTSUBSCRIPT bold_II end_POSTSUBSCRIPT , bold_O start_POSTSUBSCRIPT bold_III end_POSTSUBSCRIPT , bold_O start_POSTSUBSCRIPT bold_IV end_POSTSUBSCRIPT , bold_O start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT , bold_O start_POSTSUBSCRIPT bold_VI end_POSTSUBSCRIPT via Eq. (3) for obtaining the lowest pairwise CC index ????\mathbf{c}bold_c by

arg?min??{????,??????,????,??,????}div?(????,????).subscript??????????????????????divsubscript????subscript????\mathop{\arg\min}_{\mathbf{c}\in\{\mathbf{II,III,IV,V,VI}\}}\textit{div}(% \mathbf{O}_{\mathbf{I}},\mathbf{O}_{\mathbf{c}}).start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT bold_c ∈ { bold_II , bold_III , bold_IV , bold_V , bold_VI } end_POSTSUBSCRIPT div ( bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT , bold_O start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT ) . (5)

Thus, we can mask the LLM with the index ????\mathbf{c}bold_c in the i??iitalic_i-th layer, which can be formalted as:

??i=?(j??c?l?u?{??}??j?(??i?1)+??0),subscript????direct-sumsubscript??subscript??????????subscript????subscript????1subscript??0\mathbf{r}_{i}=\bigoplus\left(\sum_{j\in\mathbf{L}_{clu}\setminus\{\mathbf{c}% \}}\mathbf{L}_{j}(\mathbf{r}_{i-1})+\mathbf{x}_{0}\right),bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ? ( ∑ start_POSTSUBSCRIPT italic_j ∈ bold_L start_POSTSUBSCRIPT italic_c italic_l italic_u end_POSTSUBSCRIPT ? { bold_c } end_POSTSUBSCRIPT bold_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_r start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) + bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (6)

where j??c?l?u?{??}subscript??subscript??????????\sum_{j\in\mathbf{L}_{clu}\setminus\{\mathbf{c}\}}∑ start_POSTSUBSCRIPT italic_j ∈ bold_L start_POSTSUBSCRIPT italic_c italic_l italic_u end_POSTSUBSCRIPT ? { bold_c } end_POSTSUBSCRIPT denotes the concatenation of outputs expect the LLM with the index ????\mathbf{c}bold_c. Finally, we can directly obtain the final result ??lsubscript????\mathbf{r}_{l}bold_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT with respect to the question ??0subscript??0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where l??litalic_l is the number of layers. The inference process of our method is summarized in Alg. 1.

Algorithm 1 Adaptive Cluster Collaborativeness Methodology
0:??Input data ??0subscript??0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT; LLMs cluster indexs ??c?l?u={??,????,??????,????,??,????}subscript??????????????????????????????\mathbf{L}_{clu}=\{\bf{I},\bf{II},\bf{III},\bf{VI},\bf{V},\bf{IV}\}bold_L start_POSTSUBSCRIPT italic_c italic_l italic_u end_POSTSUBSCRIPT = { bold_I , bold_II , bold_III , bold_VI , bold_V , bold_IV };
0:??Final result ??lsubscript????\mathbf{r}_{l}bold_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT;
1:??Initialization: Network layers number l=4??4l=4italic_l = 4, i=1??1i=1italic_i = 1;
2:??Obtain the corresponding outputs ????,??????,????????,??????,????,??????subscript????subscript??????subscript????????subscript??????subscript????subscript??????\mathbf{O}_{\bf{I}},\mathbf{O}_{\bf{II}},\mathbf{O}_{\bf{III}},\mathbf{O}_{\bf% {IV}},\mathbf{O}_{\bf{V}},\mathbf{O}_{\bf{VI}}bold_O start_POSTSUBSCRIPT bold_I end_POSTSUBSCRIPT , bold_O start_POSTSUBSCRIPT bold_II end_POSTSUBSCRIPT , bold_O start_POSTSUBSCRIPT bold_III end_POSTSUBSCRIPT , bold_O start_POSTSUBSCRIPT bold_IV end_POSTSUBSCRIPT , bold_O start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT , bold_O start_POSTSUBSCRIPT bold_VI end_POSTSUBSCRIPT of ??c?l?usubscript????????\mathbf{L}_{clu}bold_L start_POSTSUBSCRIPT italic_c italic_l italic_u end_POSTSUBSCRIPT;
3:??Obtain the inferred output by Eq. (4);
4:??while?i<l????i<litalic_i < italic_l?do
5:?????Obtain the minimum pairwise cross-consistency index ????\mathbf{c}bold_c by Eq. (5);
6:?????Mask the LLM with the index ????\mathbf{c}bold_c;
7:?????Obtain the inferred output by Eq. (6);
8:?????Update the ??c?l?usubscript????????\mathbf{L}_{clu}bold_L start_POSTSUBSCRIPT italic_c italic_l italic_u end_POSTSUBSCRIPT;
9:?????i=i+1????1i=i+1italic_i = italic_i + 1;
10:??end?while
11:??Obtain the final result;

4 Evaluation

4.1 Experimental Setting

Datasets. To evaluate the medical decision support capacity of LLMs, we employ two publicly available medical datasets: NEJMQA katz2024gpt and MMLU-Pro-health wang2024mmlu . NEJMQA is derived from Israel’s 2022 medical specialist licensing examination, covering five core clinical disciplines: General Surgery, Internal Medicine, Obstetrics and Gynecology, Pediatrics, and Psychiatry. The dataset comprises 655 single-choice and multiple-choice questions across these disciplines. Notably, physicians must achieve a minimum passing score of 65% in each discipline to obtain board certification. We adopt this threshold as the benchmark for assessing whether LLMs demonstrate physician-level medical decision support capacity. MMLU-Pro-health, a health topic of the MMLU-Pro, contains 818 carefully curated questions spanning eight medical specialties: Virology, Professional Medicine, Nutrition, Medical Genetics, Human Aging, College Medicine, Clinical Knowledge, and Anatomy. Each question underwent rigorous processing, including initial filtering, integration, option augmentation, and expert review to enhance reasoning complexity and ensure precise healthcare evaluations. The detailed statistics are presented in Table 1, with the prompt templates provided in Appendix Tables A1 and A2.

Table 1: Statistics of the adopted specialized medical datasets. NEJMQA comprises the physician-oriented items from Israel’s 2022 medical specialist license examination, covering five clinical disciplines with both single-choice and multiple-choice questions. MMLU-Pro-health, on the other hand, contains more challenging and reasoning-focused questions across eight medical disciplines with an expanded answer option set of ten choices per question.
Dataset Number Options Type Disciplines Distribution
NEJMQA 655 A-D multiple General Surgery (141),
Internal Medicine (126),
Obstetrics and Gynecology (139),
ediatrics (99), Psychiatry (150)
MMLU-Pro-health 818 A-J single Virology (46), Professional Medicine (254),
Nutrition (179), Medical Genetics (54),
Human Aging (86), College Medicine (48),
Clinical Knowledge (72), Anatomy (79)

Models. To gain a deeper understanding of the performance advantages of our method, we conduct comparisons with twelve open-access LLMs (phi4 14B abdin2024phi , qwen2.5 14B, qwen2.5 32B qwen2.5 , qwq 32B qwq32b , openthinker 32B openthoughts , deepseek-r1 32B guo2025deepseek , llama3 instruct 70B meta2024introducing , Qwen1.5 Chat 72B, Qwen1.5 Chat 110B bai2023qwen , dbrx-instruct 132B team2024introducing , Mixtral 8x22 141B jiang2024mixtral , and WizardLM 8x22 141B xu2023wizardlm ), two close-source LLMs (GPT-4 and GPT-4o-mini achiam2023gpt , and three SOTA models ( Debate du2023improving , MoA wang2025mixtureofagents , and SelfMoA li2025rethinking ).

Implementation Details. To achieve competitive performance while maintaining low inference costs, our model exclusively employs open-access LLMs ranging from 14B to 32B parameters since a single 32B parameter model requires 21,735 MB GPU memory, equivalent to one NVIDIA GeForce RTX 4090, making the configuration both practical and cost-effective. The specific cluster of LLMs is selected based on their SD values, which includes phi4 14B, qwen2.5 14B, qwen2.5 32B, qwq 32B, openthinker 32B, and deepseek-r1 32B in our model. For fair comparisons, we follow the same prompt template setting as wang2025mixtureofagents to conduct the aggregation of LLMs outputs, which is given in Appendix Table A3. We mask two LLMs in each layer until only one LLM is used to achieve the final inference. We test these open-access LLMs through the Ollama platform and the close-source LLMs via APIs through OpenAI. The model is implemented with PyTorch on NVIDIA GeForce RTX 4090. We ensure strict adherence to the licensing terms of all models utilized in this research.

Metrics. To comprehensively evaluate the performance of the compared models and our method, we exploit a series of evaluation metrics, including accuracy (ACC), weighted F1-score (F1), Precision (PRE) powers2020evaluation , Sensitivity (SEN) yerushalmy1947statistical , Specificity (SPE) saah1998sensitivity , Matthews Correlation Coefficient (MCC) matthews1975comparison , and Cohen’s Kappa (CK) mchugh2012interrater .

Table 2: Evaluation with seven evaluation metrics on NEJMQA, demonstrating substantial performance improvements with our method in medical decision support scenarios. We highlighted the best results with bold, the second-best results with underline.
LLMs ACC F1 PRE SEN SPE MCC CK
phi4 14B 44.12% 44.04% 53.76% 44.12% 85.45% 29.59% 26.64%
qwen2.5 14B 50.84% 51.42% 52.27% 50.84% 87.00% 34.74% 34.67%
qwen2.5 32B 59.08% 59.04% 59.67% 59.08% 86.35% 45.45% 45.28%
qwq 32B 63.21% 63.19% 63.18% 63.21% 87.68% 50.68% 50.67%
openthinker 32B 64.43% 64.54% 65.57% 64.43% 88.19% 52.74% 52.52%
deepseek-r1 32B 60.46% 60.53% 61.11% 60.46% 89.43% 47.25% 47.11%
llama3 instruct 70B 62.14% 62.13% 62.56% 62.14% 87.38% 49.44% 49.32%
Qwen1.5 Chat 72B 40.46% 40.63% 41.97% 40.46% 84.21% 20.99% 20.79%
Qwen1.5 Chat 110B 53.44% 53.62% 54.39% 53.44% 87.58% 37.84% 37.70%
dbrx-instruct 132B 45.50% 44.73% 46.19% 45.50% 85.42% 27.31% 26.94%
Mixtral 8x22 141B 56.03% 56.12% 56.50% 56.03% 88.29% 41.35% 41.28%
WizardLM 8x22 141B 54.35% 55.62% 57.45% 54.35% 88.14% 40.06% 39.89%
GPT-4o-mini (07/18) 57.25% 57.13% 57.40% 57.25% 85.68% 42.81% 42.72%
GPT-4 (06/13) 66.41% 66.46% 66.61% 66.41% 91.02% 55.04% 55.02%
Debate 67.94% 67.75% 69.54% 67.94% 89.36% 57.78% 57.25%
MoA 54.35% 54.82% 55.62% 54.35% 87.85% 39.15% 39.08%
SelfMoA 39.24% 40.03% 43.42% 39.24% 83.98% 20.05% 19.59%
Our 72.06% 72.13% 73.11% 72.06% 92.59% 62.98% 62.73%

4.2 Compared Results

Comparisons on diverse disciplines. To assess whether LLMs demonstrate physician-level medical decision support capacity, we conduct the experimental comparison on NEJMQA across five clinical disciplines, and MMLU-Pro-health across eight medical specialties. Particularly, NEJMQA is derived from Israel’s 2022 medical specialist licensing examination, where physicians are required to achieve a minimum passing score of 65% in each discipline to obtain board certification. As shown in Figure?1, the performance of MoA is worse than that of a single LLM in terms of overall ACC, indicating that the collaborativeness of LLMs, which performs well in the general NLP domain, does not work in medical decision support scenarios. Even though GPT-4 achieves 66.41% in overall performance, it does not reach the official passing score of 65% in ‘Obstetrics and Gynecolog’ and ‘General Surgery’ disciplines, indicating that even the most advanced close-source models still have a gap compared with the professional physician in medical decision support scenarios. In contrast, our method obtains the best performance on both NEJMQA and MMLU-Pro-health across physician-oriented disciplines, which verifies the effectiveness.

Table 3: Evaluation with seven evaluation metrics on MMLU-Pro-health, demonstrating substantial performance improvements with our method in medical decision support scenarios. We highlighted the best results with bold, the second-best results with underline.
LLMs ACC F1 PRE SEN SPE MCC CK
phi4 14B 70.29% 70.28% 70.83% 70.29% 96.67% 66.88% 66.83%
qwen2.5 14B 62.22% 62.16% 62.53% 62.22% 95.78% 57.87% 57.83%
qwen2.5 32B 67.97% 68.04% 68.37% 67.97% 96.43% 64.32% 64.30%
qwq 32B 66.38% 66.66% 67.46% 66.38% 96.60% 62.61% 62.55%
openthinker 32B 67.73% 67.93% 68.65% 67.73% 96.73% 64.09% 64.04%
deepseek-r1 32B 59.29% 60.70% 64.25% 59.29% 95.89% 55.11% 54.65%
llama3 instruct 70B 67.85% 67.83% 68.16% 67.85% 96.42% 64.19% 64.15%
Qwen1.5 Chat 72B 14.79% 12.26% 54.04% 14.79% 90.58% 12.14% 5.77%
Qwen1.5 Chat 110B 48.29% 50.21% 58.75% 48.29% 94.25% 44.12% 42.43%
dbrx-instruct 132B 41.81% 43.11% 49.06% 41.81% 93.52% 36.01% 35.19%
Mixtral 8x22 141B 55.38% 55.56% 57.92% 55.38% 95.02% 50.40% 50.19%
WizardLM 8x22 141B 50.49% 52.12% 59.77% 50.49% 94.49% 46.01% 44.81%
GPT-4o-mini (07/18) 67.36% 67.26% 67.81% 67.36% 96.35% 63.60% 63.54%
GPT-4 (06/13) 71.76% 71.74% 72.12% 71.76% 96.84% 68.52% 68.48%
Debate 68.83% 68.81% 70.31% 68.83% 96.50% 65.33% 65.15%
MoA 56.97% 57.80% 61.24% 56.97% 95.19% 52.21% 51.93%
SelfMoA 47.19% 49.27% 56.00% 47.19% 94.12% 42.25% 41.20%
Our 75.79% 75.88% 76.32% 75.79% 97.29% 73.04% 72.99%

Evaluation with multiple metrics. Moreover, we conduct the experimental results with seven evaluation metrics on both NEJMQA and MMLU-Pro-health. As shown in Tables 2 and 3, we have the following observations:

  • ?

    Our model, composed of 14B to 32B open-access LLMs, can exceed that composed of 70B and 141B, indicating that the advantage of collaborative architecture optimization can improve the performance of LLMs. The reason for the significant improvement is two-fold. First, our model conducts SD-guide cluster construction to pursue the diversity of LLMs since we empirically observe that a single LLM with richer output tends to achieve better performance, also proven by tekin2024llm . Second, our model utilizes a CC-guide mask mechanism to ensure consistency between multiple LLMs layer by layer, achieving adaptive collaborativeness of LLMs.

  • ?

    Our model outperforms GPT-4 and GPT-4o-mini close-source models among all the comparisons. For example, on NEJMQA, our approach improves 4.12% over the second-best comparison GPT-4 on ACC, 4.38% on F1, 3.57% on PRE, 4.12% on SEN, 1.47% on SPE, 5.12% on MCC, and 5.48% on CK. In addition, on MMLU-Pro-health, our approach improves 4.03% over the second-best comparison GPT-4 on ACC, 4.14% on F1, 4.20% on PRE, 4.03% on SEN, 0.45% on SPE, 4.51% on MCC, and 4.51% on CK.

Refer to caption
Figure 4: Comparisons of the ACC (in percentage), occupied memory (in 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT MB), and running time (in seconds) of different models on NEJMQA. The diameter of the bubble is proportional to the running time.

Inference Cost Analysis. We analyze the inference costs of running time and memory occupation relative to ACC performance in Figure?4, where we observe that most single LLMs show insignificant performance improvements due to limitations in their model size and expressiveness. Although MoA surpasses most single LLMs through iterative aggregation of model outputs, it incurs substantial memory occupation and running time. In contrast, our model achieves over 6.11% higher ACC than MoA while reducing memory usage by 70,206 MB and running time by 41,855.52 seconds.

4.3 Ablation Study

We conduct ablation studies to evaluate the effectiveness of the SD and CC strategies and further analyze the influence of different mask mechanisms.

Table 4: The ablation study of the proposed cross-consistency (CC) and the self-diversity (SD) term, where ? and ? in each row indicate the non-use or use of the corresponding component, respectively. We highlighted the best results with bold, the second-best results with underline.
Dataset CC SD ACC F1 PRE SEN SPE MCC CK MEM
NEJMQA ? ? 54.35 54.82 55.62 54.35 87.85 39.15 39.08 179058
? ? 65.95 66.83 68.25 65.95 91.12 55.11 54.95 108864
? ? 62.60 62.75 63.24 62.60 90.04 50.08 49.99 165396
? ? 72.06 72.13 73.11 72.06 92.59 62.98 62.73 108852
MMLU-Pro-health ? ? 56.97 57.80 61.24 56.97 95.19 52.21 51.93 179058
? ? 47.07 48.19 52.19 47.07 94.63 41.25 40.99 108864
? ? 62.10 62.20 63.39 62.10 95.78 57.83 57.72 165396
? ? 75.79 75.88 76.32 75.79 97.29 73.04 72.99 108852

Analysis of SD and CC strategies. We conduct comprehensive ablation studies with seven evaluation metrics to deeply understand the proposed SD and CC strategies. The experimental results are listed in Table 4, where the first row denotes the baseline wang2025mixtureofagents . The second and third row denotes the variant of baseline that exploits the SD and CC strategies, respectively. The fourth row is our whole model, i.e., Our. From Table 4, we have the following observations that the advantage of the SD and CC strategies could be validated by comparing the results of the second and third rows with Our of each metric. For example, on NEJMQA, it can be seen that the simultaneously considering the SD and CC strategies could produce a 6.11% to 9.47% performance improvement.

Table 5: The ablation study of the employed mask strategies. ‘baseline’ indicates the layers without the mask mechanism, i.e., all the LLMs participate in the aggregation. ‘random’ indicates the random mask mechanism. ‘sequence’ indicates the mask mechanism in ascending order according to the individual SD values of LLMs, i.e., mask out the LLM with the smallest SD value layer by layer. ‘Our’ indicates the proposed mask mechanism using the CC maximization mechanism. The best results are highlighted in bold, the second-best results with underline.
Dataset Mask Mechanism ACC F1 PRE SEN SPE MCC CK
NEJMQA baseline 54.35 54.82 55.62 54.35 87.85 39.15 39.08
random 61.22 61.33 61.53 61.22 89.66 48.17 48.14
sequence 66.11 66.22 66.98 66.11 91.00 54.93 54.76
Our 72.06 72.13 73.11 72.06 92.59 62.98 62.73
MMLU-Pro-health baseline 56.97 57.80 61.24 56.97 95.19 52.21 51.93
random 61.61 62.28 64.94 61.61 95.72 57.46 57.19
sequence 69.07 69.44 70.89 69.07 96.55 65.62 65.49
Our 75.79 75.88 76.32 75.79 97.29 73.04 72.99

Analysis of the mask mechanism. To evaluate the advantage of our CC-driven adaptive mask mechanism, we investigate different mask mechanisms in Table 5, where the ‘random’ indicates the mask mechanism that randomly masks out the LLM layer by layer, ‘sequence’ indicates the mask mechanism that masks out the LLM with the smallest individual SD value of LLMs layer by layer in ascending order, ‘Our’ indicates the proposed mask mechanism using the CC maximization mechanism. From Table 5, we have the following observations that using CC maximization to adaptively mask low-consistency LLM in each layer is capable of improving the performance in medical decision support scenarios.

5 Conclusion

We propose an adaptive cluster collaborativeness methodology that incorporates self-diversity and cross-consistency maximization mechanisms to achieve the adaptive collaborativeness of LLMs. For self-diversity, we first calculate the fuzzy matching value between pairwise outputs within an LLM as its self-diversity value, then prioritize LLMs with high self-diversity values as cluster components in a self-supervised manner. For cross-consistency, we measure cross-consistency between pairwise outputs of the highest self-diversity LLM and others to gradually mask out LLMs with the lowest cross-consistency values. Extensive experiments on NEJMQA and MMLU-Pro-health demonstrated the effectiveness of our model in medical decision support scenarios across physician-oriented specialties, making framework leevering the collaborativeness of LLMs more efficient and affordable.

Limitations. Current research on the collaborativeness of LLMs has primarily focused on text-based modalities. However, healthcare frequently involves multimodal data, particularly the integration of imaging with textual information. Investigating the collaborativeness of visual LLMs (VLLMs) represents a promising yet underexplored direction.

Broader Impact. In many regions around the world, 24-hour access to physicians remains limited. As AI models approach physician-level performance on medical question-answering tasks, they show significant promise in supporting healthcare professionals. Our method demonstrates a performance advantage in question-answering, suggesting that our work could meaningfully advance such applications. Importantly, this technology is designed to complement rather than replace physicians, especially in resource-constrained settings where specialists are in short supply.

References

  • [1] Long Ouyang, Jeffrey Wu, Xu?Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et?al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  • [2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia?Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et?al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [3] Lingjiao Chen, Jared?Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei?A Zaharia, and James?Y Zou. Are more llm calls all you need? towards the scaling properties of compound ai systems. Advances in Neural Information Processing Systems, 37:45767–45790, 2024.
  • [4] Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy?Yi Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel?Scott Smith, Yian Yin, et?al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI, 1(8):AIoa2400196, 2024.
  • [5] Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative ai by backpropagating language model feedback. Nature, 639(8055):609–616, 2025.
  • [6] Arun?James Thirunavukarasu, Darren Shu?Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting?Fang Tan, and Daniel Shu?Wei Ting. Large language models in medicine. Nature medicine, 29(8):1930–1940, 2023.
  • [7] Ran Xu, Hejie Cui, Yue Yu, Xuan Kan, Wenqi Shi, Yuchen Zhuang, Wei Jin, Joyce Ho, and Carl Yang. Knowledge-infused prompting improves clinical text generation with large language models. In NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI, 2023.
  • [8] Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan Ilgen, Emma Pierson, Pang Wei?W Koh, and Yulia Tsvetkov. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning. Advances in Neural Information Processing Systems, 37:28858–28888, 2024.
  • [9] Uriel Katz, Eran Cohen, Eliya Shachar, Jonathan Somer, Adam Fink, Eli Morse, Beki Shreiber, and Ido Wolf. Gpt versus resident physicians—a benchmark based on official board scores. NEJM AI, 1(5):AIdbp2300192, 2024.
  • [10] Zhen Chen, Zhihao Peng, Xusheng Liang, Cheng Wang, Peigan Liang, Linsheng Zeng, Minjie Ju, and Yixuan Yuan. Map: Evaluation and multi-agent enhancement of large language models for inpatient pathways. arXiv preprint arXiv:2503.13205, 2025.
  • [11] Junlin Wang, Jue WANG, Ben Athiwaratkun, Ce?Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities. In The Thirteenth International Conference on Learning Representations, 2025.
  • [12] Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36:51991–52008, 2023.
  • [13] Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17889–17904, 2024.
  • [14] Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better LLM-based evaluators through multi-agent debate. In The Twelfth International Conference on Learning Representations, 2024.
  • [15] Jintian Zhang, Xin Xu, Ningyu Zhang, Ruibo Liu, Bryan Hooi, and Shumin Deng. Exploring collaboration mechanisms for LLM agents: A social psychology view. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14544–14607, Bangkok, Thailand, August 2024. Association for Computational Linguistics.
  • [16] Andrew Estornell and Yang Liu. Multi-llm debate: Framework, principals, and interventions. Advances in Neural Information Processing Systems, 37:28938–28964, 2024.
  • [17] Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yulia Tsvetkov. Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration. In ACL (1), 2024.
  • [18] Yilun Du, Shuang Li, Antonio Torralba, Joshua?B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In Forty-first International Conference on Machine Learning, 2023.
  • [19] Wenzhe Li, Yong Lin, Mengzhou Xia, and Chi Jin. Rethinking mixture-of-agents: Is mixing different large language models beneficial? arXiv preprint arXiv:2502.00674, 2025.
  • [20] Emmett Keeler, Mark?D Perkins, Peter Small, Christy Hanson, Steven Reed, Jane Cunningham, Julia?E Aledort, Lee Hillborne, Maria?E Rafael, Federico Girosi, et?al. Reducing the global burden of tuberculosis: the contribution of improved diagnostics. Nature, 444(Suppl 1):49–57, 2006.
  • [21] Yiqiu Shen, Farah?E Shamout, Jamie?R Oliver, Jan Witowski, Kawshik Kannan, Jungkyu Park, Nan Wu, Connor Huddleston, Stacey Wolfson, Alexandra Millet, et?al. Artificial intelligence system reduces false-positive findings in the interpretation of breast ultrasound exams. Nature communications, 12(1):5645, 2021.
  • [22] Krishnamurthy Dvijotham, Jim Winkens, Melih Barsbey, Sumedh Ghaisas, Robert Stanforth, Nick Pawlowski, Patricia Strachan, Zahra Ahmed, Shekoofeh Azizi, Yoram Bachrach, et?al. Enhancing the reliability and accuracy of ai-enabled diagnosis via complementarity-driven deferral to clinicians. Nature Medicine, 29(7):1814–1820, 2023.
  • [23] Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik?Siu Chan, Xuhai Xu, Daniel McDuff, Hyeonhoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, Hae Park, et?al. Mdagents: An adaptive collaboration of llms for medical decision-making. Advances in Neural Information Processing Systems, 37:79410–79452, 2024.
  • [24] Bingbing Wen, Chenjun Xu, Robert Wolfe, Lucy?Lu Wang, Bill Howe, et?al. Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration. In NeurIPS 2024 Workshop on Behavioral Machine Learning, 2024.
  • [25] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed?Chi, Quoc?V Le, Denny Zhou, et?al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  • [26] Denny Zhou, Nathanael Sch?rli, Le?Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et?al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
  • [27] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023.
  • [28] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et?al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume?38, pages 17682–17690, 2024.
  • [29] Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu?Wang. Skeleton-of-thought: Large language models can do parallel decoding. Proceedings ENLSP-III, 2023.
  • [30] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu?Chen, Yankai Lin, et?al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024.
  • [31] Qineng Wang, Zihao Wang, Ying Su, Hanghang Tong, and Yangqiu Song. Rethinking the bounds of llm reasoning: Are multi-agent discussions the key? In 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024, pages 6106–6131. Association for Computational Linguistics (ACL), 2024.
  • [32] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed?Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  • [33] Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537, 2023.
  • [34] Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007, 2023.
  • [35] Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142, 2023.
  • [36] Max Bachmann. rapidfuzz/rapidfuzz: Release 3.8.1, April 2024.
  • [37] Vladimir?I Levenshtein et?al. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume?10, pages 707–710. Soviet Union, 1966.
  • [38] Yubo Wang, Xueguang Ma, Ge?Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et?al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024.
  • [39] Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell?J Hewett, Mojan Javaheripi, Piero Kauffmann, et?al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024.
  • [40] An?Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo?Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le?Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu?Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024.
  • [41] Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025.
  • [42] OpenThoughts Team. Open thoughts, February 2025.
  • [43] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et?al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
  • [44] AI?Meta. Introducing meta llama 3: The most capable openly available llm to date. Meta AI, 2(5):6, 2024.
  • [45] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu?Han, Fei Huang, et?al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  • [46] The Mosaic?Research Team. Introducing dbrx: A new state-of-the-art open llm, 2024.
  • [47] Albert?Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra?Singh Chaplot, Diego de?las Casas, Emma?Bou Hanna, Florian Bressand, et?al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  • [48] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu?Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  • [49] David?MW Powers. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. arXiv preprint arXiv:2010.16061, 2020.
  • [50] Jacob Yerushalmy. Statistical problems in assessing methods of medical diagnosis, with special reference to x-ray techniques. Public Health Reports (1896-1970), pages 1432–1449, 1947.
  • [51] AJ?Saah and DR?Hoover. Sensitivity and specificity revisited: significance of the terms in analytic and diagnostic language. In Annales de Dermatologie et de Venereologie, volume 125, pages 291–294, 1998.
  • [52] Brian?W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442–451, 1975.
  • [53] Mary?L McHugh. Interrater reliability: the kappa statistic. Biochemia medica, 22(3):276–282, 2012.
  • [54] Selim Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, and Ling Liu. Llm-topla: Efficient llm ensemble by maximising diversity. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 11951–11966, 2024.
什么牌子 前期怀孕有什么症状 便秘不能吃什么食物 颈椎痛看什么科 好运是什么意思
nec医学上是什么意思 腱鞘炎什么症状 遮羞布是什么意思 婚动是什么意思 心悸是什么感觉
男命七杀代表什么 孝庄是康熙的什么人 肠系膜淋巴结肿大吃什么药 属蛇的本命佛是什么佛 mj是什么意思
囟门凹陷是什么原因 红加绿等于什么颜色 桃字五行属什么 男人什么脸型最有福气 经期为什么不能拔牙
小孩改姓需要什么手续hcv9jop3ns9r.cn 过敏性咽炎吃什么药hcv9jop7ns9r.cn 怨念是什么意思hcv8jop1ns1r.cn 眼睛一直眨是什么原因luyiluode.com 体虚是什么原因引起的hcv8jop3ns5r.cn
电脑关机快捷键是什么hcv8jop4ns1r.cn 气虚便秘吃什么中成药hcv8jop3ns0r.cn 女生的下面长什么样hcv9jop4ns5r.cn 吃什么补血补气效果好hcv9jop2ns2r.cn 做梦梦见老公出轨是什么意思hcv8jop7ns9r.cn
苏麻为什么不嫁给康熙hcv8jop6ns3r.cn 1.12是什么星座hcv9jop5ns5r.cn 副研究员什么级别hcv8jop4ns7r.cn 芃字五行属什么gangsutong.com 肾阳虚吃什么药hcv8jop0ns6r.cn
普惠幼儿园是什么意思hcv9jop5ns2r.cn 三月一日是什么星座mmeoe.com 白头发有什么方法变黑hcv7jop6ns4r.cn 女性真菌感染是什么原因造成的hcv8jop7ns5r.cn 胎盘位于前壁是什么意思hcv8jop9ns5r.cn
百度