当我们用python爬取一些网站的时候,会发现有的是中英又语的,但是我们只需要中文,这时候怎么写正则表达式呢,首先给出答案: [\u3002\uff1b\uff0c\uff1a\u201c\u201d\uff08\uff09\u3001\uff1f\u300a\u300b]
朋友们可以先运行一下下面的例子加深印象。
s = ”’I am ten years old now, I am studying at a primary school, and I am in grade four. There are many subjects for me to learn, among them, I like Chinese the most. Chinese is our country’s language, it has more than five thousand years of history. I am so interested in Chinese culture, and learning Chinese well can help me understand Chinese culture better.
我现在十岁了,我在一所小学上学,我现在读四年级。我要学很多的科目,在这些科目当中,我最喜欢语文。汉语是我们国家的语言,有超过五千年的历史。我对中国的历史很感兴趣,学好语文能让我更好的了解中国历史。”’
t = re.findall(‘[\u3002\uff1b\uff0c\uff1a\u201c\u201d\uff08\uff09\u3001\uff1f\u300a\u300b\u4e00-\u9fa5]’,s)
print(”.join(t))
运行结果如下:
未经允许不得转载:445IT之家 » python提取中文的正则表达式