TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Type prediction	ManyTypes4TypeScript	GraphPolyGot	Average Accuracy	61.00	# 6
Type prediction	ManyTypes4TypeScript	GraphPolyGot	Average Precision	58.36	# 4
Type prediction	ManyTypes4TypeScript	GraphPolyGot	Average Recall	58.91	# 3
Type prediction	ManyTypes4TypeScript	GraphPolyGot	Average F1	58.63	# 4
Type prediction	ManyTypes4TypeScript	PolyGot	Average Accuracy	61.29	# 5
Type prediction	ManyTypes4TypeScript	PolyGot	Average Precision	58.81	# 3
Type prediction	ManyTypes4TypeScript	PolyGot	Average Recall	58.91	# 3
Type prediction	ManyTypes4TypeScript	PolyGot	Average F1	58.86	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multilingual-training-for-software/type-prediction-on-manytypes4typescript)](https://paperswithcode.com/sota/type-prediction-on-manytypes4typescript?p=multilingual-training-for-software)`

Multilingual training for Software Engineering

3 Dec 2021 · Toufique Ahmed, Premkumar Devanbu ·

Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks. Several SE tasks have all been subject to this approach, with performance gradually improving over the past several years with better models and training methods. More, and more diverse, clean, labeled data is better for training; but constructing good-quality datasets is time-consuming and challenging. Ways of augmenting the volume and diversity of clean, labeled data generally have wide applicability. For some languages (e.g., Ruby) labeled data is less abundant; in others (e.g., JavaScript) the available data maybe more focused on some application domains, and thus less diverse. As a way around such data bottlenecks, we present evidence suggesting that human-written code in different languages (which performs the same function), is rather similar, and particularly preserving of identifier naming patterns; we further present evidence suggesting that identifiers are a very important element of training data for software engineering tasks. We leverage this rather fortuitous phenomenon to find evidence that available multilingual training data (across different languages) can be used to amplify performance. We study this for 3 different tasks: code summarization, code retrieval, and function naming. We note that this data-augmenting approach is broadly compatible with different tasks, languages, and machine-learning models.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Code Summarization

Retrieval

Type prediction

Datasets

CodeSearchNet

CodeXGLUE ManyTypes4TypeScript

Results from the Paper

Edit

Ranked #5 on Type prediction on ManyTypes4TypeScript

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Type prediction	ManyTypes4TypeScript	GraphPolyGot	Average Accuracy	61.00	# 6	Compare
			Average Precision	58.36	# 4	Compare
			Average Recall	58.91	# 3	Compare
			Average F1	58.63	# 4	Compare
Type prediction	ManyTypes4TypeScript	PolyGot	Average Accuracy	61.29	# 5	Compare
			Average Precision	58.81	# 3	Compare
			Average Recall	58.91	# 3	Compare
			Average F1	58.86	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Multilingual training for Software Engineering

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove