Boitumelo Nkwe and Michael Kyobe, Department of Information Systems, University of Cape Town, Cape Town, South Africa
The increasing adoption of the Internet of Things (IoT) has introduced unique challenges to both users and the cybersecurity domain. As IoT evolves, cybersecurity threats and vulnerabilities meted against IoT devices also increased. IoT devices are susceptible to breaches, therefore forensic investigations focusing on IoT technologies need to be improved. This study aims to provide an understanding of the challenges in IoT forensics investigation since 2017. Furthermore, the article looks at different solutions in the form of frameworks and methodologies that have been developed to address these challenges and the gaps in the existing literature. The researchers adopted a systematic review methodology to guide the synthesis of the literature. The key issues highlighted in this study include the heterogeneous nature of IoT, the lack of proper investigative tools and frameworks that encompass all levels of IoT forensics, the lack of privacy, and the lack of standardization in the investigation process.
Internet of Things (IoT), IoT forensics, Cybersecurity, Challenges
Xiaoqin HU, Beijing Language and Culture University, China
This research aims to explore a deeper representation of the internal structure and semantic relationship of multiword nouns (MWNs) for improving MWN discovery. This representation focuses on MWN formations, which follow a series of categorical and semantic constraints. The internal semantic relations of MWNs are represented by semantic class combinations of constituents, and the internal structures are represented by a set of categorical combinations in a hierarchy. These linguistically motivated semantic features are combined with statistically motivated semantic features, and the results present an improvement for MWN discovery.
Multiword nouns, automatic discovery, internal structure, internal semantic relation, semantic class combination, linguistic knowledge
Yanan Jia, Businessolver, USA
As human-machine voice interfaces provide easy access to increasingly intelligent machines, many state-of-the art automatic speech recognition (ASR) systems are proposed. However, commercial ASR systems usually have poor performance on domain-specific speech especially under low-resource settings. The author works with pre-trained DeepSpeech2 and Wav2Vec2 acoustic models to develop benefit-specific ASR systems. The domain specific data are collected using proposed semi-supervised learning annotation with little human intervention. The best performance comes from a fine-tuned Wav2Vec2-Large-LV60 acoustic model with an external KenLM, which surpasses the Google and AWS ASR systems on benefit-specific speech. The viability of using error prone ASR transcriptions as part of spoken language understanding (SLU) is also investigated. Results of a benefit-specific natural language understanding (NLU) task show that the domain-specific fine-tuned ASR system can outperform the commercial ASR systems even when its transcriptions have higher word error rate (WER), and the results between fine-tuned ASR and human transcriptions are similar.
Automatic Speech Recognition, DeepSpeech2, Wav2Vec2, Semi-supervised learning annotation, Spoken language understanding
HongLi Deng1 XinZhong Liu2 XianMing Bei3 1School of Liberal Arts, Jinan University, Guangzhou, Guangdong, China College of Culture and Communication, Guangxi Science and Technology NormalUniversity Laibin, Guangxi, China 2School of Liberal Arts, Jinan University, Guangzhou, Guangdong, China 3School of Chinese Language and Culture, Guangdong University of Foreign Studies, Guangzhou, Guangdong, China
Based on the theory of second language acquisition, The article analyzes the pronunciation of the “er” which means “two” in Chinese by learners from different native language backgroud, and explores the key acoustic characteristics and related influencing factors on the acquisition of retroflex vowels in S&P.; The study has 4 findings: (1) F2 of retroflex vowels rises in S&P; , and F3 falls, and the difference between F3 endpoint and F2 endpoint is small, which means that F3 and F2 are closer to each other. The key characteristics to the learner s “two” pronunciation is the slope of F3 and the value of F3, the greater the slope of F3 falls, the smaller the value, and the closer the learner’s “two” pronunciation is to the S&P.; (2) Retroflex vowels are highly marked phonemes, which makes it difficult to acquire them. (3) Factors such as the acquisition environment, the length of second language acquisition time,influence the acquisition of retroflex vowel. (4)The early learning environment promotes the acquisition of retroflex vowel in Putonghua.
retroflex vowels; slope of F3; acoustic characteristics; influencing factors ;acquisition theory
Raul Salles de Padua Imran Qureshi and Mustafa U. Karakaplan, Stanford University, University of Texas at Austin, University of South Carolina
Financial analysis is an important tool for evaluating company performance. Practitioners work to answer financial questions to make profitable investment decisions, and use advanced quantitative analyses to do so. As a result, Financial Question Answering (QA) is a question answering task that requires deep reasoning about numbers. Furthermore, it is unknown how well pre-trained language models can reason in the financial domain. The current state-of-the-art requires a retriever to collect relevant facts about the financial question from the text and a generator to produce a valid financial program and a final answer. However, recently large language models like GPT-3 [3] have achieved state-of-the-art performance on wide variety of tasks with just a few shot examples. We run several experiments with GPT-3 and find that a separate retrieval model and logic engine continue to be essential components to achieving SOTA performance in this task, particularly due to the precise nature of financial questions and the complex information stored in financial documents. With this understanding, our refined prompt engineering approach on GPT-3 achieves near SOTA accuracy without any fine-tuning.
Question Answering, GPT-3, Financial Question Answering, Large Language Models, Information Retrieval, BERT, RoBERTa, F
Wu Zhang Miotech, 69 Jervois St, Sheung Wan, Hong Kong
Duplicated training data usually downgrades machine learning models’ performance. This paper presents a practical algorithm for efficiently deduplicating highly similar news articles in large datasets. Our algorithm comprises three components - document embedding, similarity computation, and clustering- each utilizing specific algorithms and tools to optimize both speed and performance. We demonstrate the efficacy of our approach by accurately deduplicating over 7 million news articles in less than 4 hours.
News deduplication, natural language processing
Asrul Sani Ariesandy1, Mukhlis Amien2, Alham Fikri Aji3, Radityo Eko Prasojo4,1Sekolah Tinggi Informatika & Komputer Indonesia (STIKI), Malang, Indonesia, 2Kata.ai Research Team, Jakarta, Indonesia, 3Beijing Institute of Technology, China, 4Faculty of Computer Science, Universitas Indonesia
Neural Machine Translation (NMT) works better in Indonesian when it takes into account local dialects, geographical context, and regional culture (colloquialism). NMT is typically domaindependent and style-dependent, and it requires lots of training data. State-of-the-art NMT models often fall short in handling colloquial variations of its source language and the lack of parallel data in this regard is a challenging hurdle in systematically improving the existing models, despite the fact that Indonesians frequently employ colloquial language. In this work, we develop a colloquial Indonesian-English test-set collected from YouTube transcript and Twitter. We perform synthetic style augmentation to the source formal Indonesian language and show that it improves the baseline Id-En models (in BLEU) over the new test data.
Neural Machine Translation, NMT, Natural Language Processing, NLP, Low-Resource Language, Indonesian, Artificial Intelligence