数据仓库与数据挖掘

本站首页 管理页面写新日志退出

« October 2025 »
日一二三四五六
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31

公告

数据仓库&数据挖掘

对某一件事需要坚持方能真正完成这件事

薛峰

2009.02.03

我的分类（专题）

首页(85)
金融行业(4)
反洗钱专栏(9)
数据挖掘(11)
电信行业(9)
综合(18)
数据仓库(28)

日志更新

数据仓库中的ETL过程
数据仓库是干什么的，到现在，我终于看到了
几个非常经典的对“数据仓库”的解释
商业银行反洗钱的法律与实务分析
UNIX系统操作命令
AIX捉虫记之__invscoutd
AIX基础教程
AIX文件系统性能调优
[命令] xargs命令
一个AIX操作系统中的用户信息拷贝到另外

留言板

签写新留言

链接

AIX快活如意斋

Blog信息

blog名称:数据仓库与数据挖掘
日志总数:85
评论数量:14
留言数量:0
访问次数:723495
建立时间:2005年3月17日

[数据仓库]IBM DB2 学习笔记　
网上资源

薛峰发表于 2005/6/30 8:53:42

【IBM DB2 学习笔记一】【彭建军】
注意：在 IBM DB2 中，与 MS SQL Server 2000 中相同的语法或者概念，这里就不一一列出了。
一、【DB2 SQL 概述】　　 1、　　【模式】　　　 1.1、模式是已命名的对象（如表和视图）的集合。模式提供了数据库中对象的逻辑分类。　　　 1.2、当在数据库中创建对象的时候，系统就隐性的创建了模式。当然，也可以使用 CREATE SCHEMA 显式的创建模式。　　　 1.3、当命名对象的时候，需要注意对象的名称有两个部分，即，模式.对象名称,形如：pjj.TempTable1。如果不显示指定模式，则系统使用默认模式（默认用户的ID）。　　 2、　　【数据类型】定长字符串　　　　　 CHAR(x)　　　　　　　　　　　x值域（1～254）　　　　　　　一个字节序列变长字符串　　　　　 VARCHAR(X) 　　　　　　　　　　

阅读全文(4448) | 回复(0) | 编辑 | 精华 | 删除

[数据挖掘]数据挖掘在电信欺诈侦测中的应用
随笔, 心得体会

薛峰发表于 2005/6/28 10:03:29

摘要：电信领域欺诈现象比较突出，本文对数据挖掘技术在电信欺诈侦测中的应用进行研究,并利用某移动运营商的真实数据进行有效性验证。具体通过商业理解、数据理解、数据准备、模型生成、模型应用等几个步骤完成欺诈的侦测。在模型生成阶段利用聚类算法中的Kohonen神经网络算法，Kohonen是一种自组织学习算法。
关键字：数据挖掘；欺诈侦测；kohonen算法；CRISP-DM
1 引言
随着移动业务的迅猛发展，移动通信业的收入日益增长。但是，随之而来的移动网络的欺诈行为也不断涌现，全球移动通信业都广泛面临着无线欺诈的严重问题，从而导致电信运营商的收入受到损失，额外支出的增加，进而致使利润下降，而移动用户的合法权益也受到损害，电信运营商的信誉无法得到保障。
无线欺诈类型可以简单的分为四类：（1）时间欺诈：占用了移动通信的时长而不付费用，该类欺诈可以分为两类，一是技术型欺诈(包括码机、魔术电话等)，另一类是用户欺诈(漫游、滥用补充业务以及善意的欺诈行为)；（2）内部欺诈：运营商内部人员利用职权非法牟利；（3）手机欺诈：进行非

阅读全文(5730) | 回复(0) | 编辑 | 精华 | 删除

数据挖掘部分算法的matlab实现 id3
随笔, 读书笔记

薛峰发表于 2005/6/27 14:22:13

　function D = ID3(train_features, train_targets, params, region)

% Classify using Quinlan´s ID3 algorithm
% Inputs:
% features - Train features
% targets     - Train targets
% params - [Number of bins for the data, Percentage of incorrectly assigned samples at a node]
% region     - Decision region vector: [-x x -y y number_of_points]
%
% Outputs
% D - Decision sufrace

[Ni, M]    = size(train_features);

%Get parameters
[Nbins, inc_node] = process_params(params);
inc_node    = inc_node*M/100;

%For the decision region
N           = region(5);
mx          = ones(N,1) * linspace (region(1),region(2),N);
my          = linspace (region(3),region(4),N)´ * ones(1,N);
flatxy      = [mx(:), my(:)]´;

%Preprocessing
[f, t, UW, m]      = PCA(train_features, train_targets, Ni, region);
train_features  = UW * (train_features - m*ones(1,M));;
flatxy          = UW * (flatxy - m*ones(1,N^2));;

%First, bin the data and the decision region data
[H, binned_features]= high_histogram(train_features, Nbins, region);
[H, binned_xy]      = high_histogram(flatxy, Nbins, region);

%Build the tree recursively
disp(´Building tree´)
tree        = make_tree(binned_features, train_targets, inc_node, Nbins);

%Make the decision region according to the tree
disp(´Building decision surface using the tree´)
targets = use_tree(binned_xy, 1:N^2, tree, Nbins, unique(train_targets));

D = reshape(targets,N,N);
%END

function targets = use_tree(features, indices, tree, Nbins, Uc)
%Classify recursively using a tree

targets = zeros(1, size(features,2));

if (size(features,1) == 1),
    %Only one dimension left, so work on it
    for i = 1:Nbins,
        in = indices(find(features(indices) == i));
        if ~isempty(in),
            if isfinite(tree.child(i)),
                targets(in) = tree.child(i);
            else
                %No data was found in the training set for this bin, so choose it randomally
                n           = 1 + floor(rand(1)*length(Uc));
                targets(in) = Uc(n);
            end
        end
    end
    break
end

%This is not the last level of the tree, so:
%First, find the dimension we are to work on
dim = tree.split_dim;
dims= find(~ismember(1:size(features,1), dim));

%And classify according to it
for i = 1:Nbins,
    in      = indices(find(features(dim, indices) == i));
    targets = targets + use_tree(features(dims, :), in, tree.child(i), Nbins, Uc);
end

%END use_tree

function tree = make_tree(features, targets, inc_node, Nbins)
%Build a tree recursively

[Ni, L]     = size(features);
Uc          = unique(targets);

%When to stop: If the dimension is one or the number of examples is small
if ((Ni == 1) | (inc_node > L)),
    %Compute the children non-recursively
    for i = 1:Nbins,
        tree.split_dim  = 0;
        indices         = find(features == i);
        if ~isempty(indices),
            if (length(unique(targets(indices))) == 1),
                tree.child(i) = targets(indices(1));
            else
                H               = hist(targets(indices), Uc);
                [m, T]          = max(H);
                tree.child(i)   = Uc(T);
            end
        else
            tree.child(i)   = inf;
        end
    end
    break
end

%Compute the node´s I
for i = 1:Ni,
    Pnode(i) = length(find(targets == Uc(i))) / L;
end
Inode = -sum(Pnode.*log(Pnode)/log(2));

%For each dimension, compute the gain ratio impurity
delta_Ib    = zeros(1, Ni);
P           = zeros(length(Uc), Nbins);
for i = 1:Ni,
    for j = 1:length(Uc),
        for k = 1:Nbins,
            indices = find((targets == Uc(j)) & (features(i,:) == k));
            P(j,k)  = length(indices);
        end
    end
    Pk          = sum(P);
    P           = P/L;
    Pk          = Pk/sum(Pk);
    info        = sum(-P.*log(eps+P)/log(2));
    delta_Ib(i) = (Inode-sum(Pk.*info))/-sum(Pk.*log(eps+Pk)/log(2));
end

%Find the dimension minimizing delta_Ib
[m, dim] = max(delta_Ib);

%Split along the ´dim´ dimension
tree.split_dim = dim;
dims           = find(~ismember(1:Ni, dim));
for i = 1:Nbins,
    indices       = find(features(dim, :) == i);
    tree.child(i) = make_tree(features(dims, indices), targets(indices), inc_node, Nbins);
end

阅读全文(10016) | 回复(2) | 编辑 | 精华 | 删除

[数据挖掘]数据挖掘部分算法的matlab实现 C4_5
网上资源, 随笔

薛峰发表于 2005/6/27 14:21:09

function D = C4_5(train_features, train_targets, inc_node, region)

% Classify using Quinlan´s C4.5 algorithm
% Inputs:
% features - Train features
% targets     - Train targets
% inc_node    - Percentage of incorrectly assigned samples at a node
% region     - Decision region vector: [-x x -y y number_of_points]
%
% Outputs
% D - Decision sufrace

%NOTE: In this implementation it is assumed that a feature vector with fewer than 10 unique values (the parameter Nu)
%is discrete, and will be treated as such. Other vectors will be treated as continuous

[Ni, M] = size(train_features);
inc_node    = inc_node*M/100;
Nu          = 10;

%For the decision region
N           = region(5);
mx          = ones(N,1) * linspace (region(1),region(2),N);
my          = linspace (region(3),region(4),N)´ * ones(1,N);
flatxy      = [mx(:), my(:)]´;

%Preprocessing
%[f, t, UW, m]      = PCA(train_features, train_targets, Ni, region);
%train_features  = UW * (train_features - m*ones(1,M));;
%flatxy          = UW * (flatxy - m*ones(1,N^2));;

%Find which of the input features are discrete, and discretisize the corresponding
%dimension on the decision region
discrete_dim = zeros(1,Ni);
for i = 1:Ni,
   Nb = length(unique(train_features(i,:)));
   if (Nb <= Nu),
      %This is a discrete feature
      discrete_dim(i) = Nb;
      [H, flatxy(i,:)] = high_histogram(flatxy(i,:), Nb);
   end
end

%Build the tree recursively
disp(´Building tree´)
tree        = make_tree(train_features, train_targets, inc_node, discrete_dim, max(discrete_dim), 0);

%Make the decision region according to the tree
disp(´Building decision surface using the tree´)
targets = use_tree(flatxy, 1:N^2, tree, discrete_dim, unique(train_targets));

D   = reshape(targets,N,N);
%END

function targets = use_tree(features, indices, tree, discrete_dim, Uc)
%Classify recursively using a tree

targets = zeros(1, size(features,2));

if (tree.dim == 0)
   %Reached the end of the tree
   targets(indices) = tree.child;
   break
end

%This is not the last level of the tree, so:
%First, find the dimension we are to work on
dim = tree.dim;
dims= 1:size(features,1);

%And classify according to it
if (discrete_dim(dim) == 0),
   %Continuous feature
   in = indices(find(features(dim, indices) <= tree.split_loc));
   targets = targets + use_tree(features(dims, :), in, tree.child(1), discrete_dim(dims), Uc);
   in = indices(find(features(dim, indices) >  tree.split_loc));
   targets = targets + use_tree(features(dims, :), in, tree.child(2), discrete_dim(dims), Uc);
else
   %Discrete feature
   Uf = unique(features(dim,:));
for i = 1:length(Uf),
   in      = indices(find(features(dim, indices) == Uf(i)));
      targets = targets + use_tree(features(dims, :), in, tree.child(i), discrete_dim(dims), Uc);
   end
end

%END use_tree

function tree = make_tree(features, targets, inc_node, discrete_dim, maxNbin, base)
%Build a tree recursively

[Ni, L]     = size(features);
Uc         = unique(targets);
tree.dim = 0;
%tree.child(1:maxNbin) = zeros(1,maxNbin);
tree.split_loc = inf;

if isempty(features),
   break
end

%When to stop: If the dimension is one or the number of examples is small
if ((inc_node > L) | (L == 1) | (length(Uc) == 1)),
   H = hist(targets, length(Uc));
   [m, largest] = max(H);
   tree.child = Uc(largest);
   break
end

%Compute the node´s I
for i = 1:length(Uc),
    Pnode(i) = length(find(targets == Uc(i))) / L;
end
Inode = -sum(Pnode.*log(Pnode)/log(2));

%For each dimension, compute the gain ratio impurity
%This is done separately for discrete and continuous features
delta_Ib    = zeros(1, Ni);
split_loc =

(下面还有2838字)

阅读全文(11661) | 回复(0) | 编辑 | 精华 | 删除

[数据挖掘]市场细分——企业成功的一大法宝
随笔, 读书笔记, 心得体会

薛峰发表于 2005/6/27 14:13:21

企业经营者必须通过市场调研，根据消费者对商品的不同欲望与需求、不同的购买行为与习惯，把消费者整体市场划分为具有一定的类似性特征的若干子市场。

　　　　　　　　　　市场细分的重要作用

　　一、有利于企业确定自己的目标市场。目标市场能否正确选择，直接决定着企业今后一系列发展战略的确定，决定了企业今后若干年发展后劲的“先天条件”。所以企业必须在深入进行市场细分化的基础上，寻找一个理想的目标市场。如广东江门市有一专门经营空调产品的公司——江门市时尚冷气公司，原叫供销综合贸易公司，主要经营土产、日杂商品，由于商品附加价值低，经营单位多，竞争激烈，企业经济效益每况愈下，如继续经营，将难以为继。于是在1988年，该公司决策部门经过对企业内外部环境条件的认真分析，决定抓住当时空调机供不应求的大好时机，立足江门市区，辐射珠江三角洲等经济发达地区，充分发挥供销企业的传统优势，努力争取货源，利用靠近港澳的优势，直接经营进口名牌空调，减少中间环节，降低流通费用。结果由于经营方向对路，目标市场选择恰当，充分抓住了目标市场上对空调

(下面还有4580字)

阅读全文(3201) | 回复(0) | 编辑 | 精华 | 删除

« 1 2 3 4 5 6 7 8 9 10 › »

站点首页 | 联系我们 | 博客注册 | 博客登陆

Sponsored By W3CHINA
W3CHINA Blog 0.8 Processed in 0.195 second(s), page refreshed 144814461 times.
《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》
苏ICP备05006046号