胖胖禿: 正規表示式-regex-取出中文字

如果想要找出中文字, 可以使用 regex 對中文unicode的支援.


public static void main(String[] args) {
   String oriStr = "我是小胖123444#$%^*(";
   String afterStr = getChinese(oriStr);
   System.out.println(afterStr); // 顯示 "我是小胖"
}
public static String getChinese(String in) {
    if (in == null || ("".equals(in))) {
        return ""; 
    }
    Matcher matcher = Pattern.compile("\\p{InCJKUnifiedIdeographs}").matcher(in);
    StringBuffer out = new StringBuffer();
    while (matcher.find()) {
        out.append(matcher.group());
    }
    return out.toString();
}

這邊要解釋「\\p{InCJKUnifiedIdeographs}」：
在 Unicode 中，有針對各個編碼區塊做分類，它的列表可以參照下面的檔名：

Unicode 3.2 的列表：
http://www.unicode.org/Public/3.2-Update/Blocks-3.2.0.txt

Unicode 4.1.0 的列表：
http://www.unicode.org/Public/4.1.0/ucd/Blocks.txt

Unicode 5.0 的列表
http://www.unicode.org/Public/5.0.0/ucd/Blocks.txt

這個表裡面列出了統一碼區塊名和相對應的 Unicode 區段，
而其中的「CJK Unified Ideographs」就是我們的中文字區段(看名稱，應該包含日文、簡體、韓文)，
而在 RegEx 中，可以透過「\p」來指定這個統一碼區塊名，
透過指定它，找出相對應的文字範圍，Java 就是這樣做的。

相關文章 : http://developers.sun.com/dev/gadc/unicode/perl/perl561.html

胖胖禿

2009年8月4日星期二

正規表示式-regex-取出中文字

沒有留言:

張貼留言

搜尋此網誌

著作人

標籤

網誌存檔

追蹤者

留言版

Java

胖胖禿

2009年8月4日 星期二